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The  major  goal  of  this  dissertation  is  to  present  a new  paradigm  for  the  unsupervised 
competitive  segmentation  and  modeling  of  signals  into  stationary  segments,  and  to  eluci- 
date and  develop  several  specific  algorithms  that  fall  within  the  paradigm.  The  identifica- 
tion of  the  segments  is  accomplished  by  several  experts  which  act  in  parallel  on  a local 
section  of  the  signal  at  a time.  The  experts  can  utilize  any  unsupervised  learning  algorithm 
to  which  one  can  ascribe  a performance  measure,  the  two  most  common  of  which  are  the 
tasks  of  prediction  and  auto-association.  The  experts  learn  in  the  usual  way  except  that  the 
local  data  are  weighted  based  on  how  well  the  experts  have  performed  in  the  past.  This 
information  is  provided  by  the  gate,  which  analyzes  the  performance  of  the  experts  and 
then  ascribes  to  each  a relative  measure  of  the  validity  of  each  expert  to  that  part  of  the 
data.  The  cumulative  effect  of  these  gate  assignments  is  to  segment  the  data. 
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A gate  can  be  either  input  or  output  based.  An  input  based  gate  attempts  to  learn  to 
predict  the  validity  of  the  experts  from  the  input  only,  thus  eventually  eliminating  the  need 
for  performance  information,  while  an  output  based  gate  always  analyzes  the  performance 
of  the  experts. 

We  use  principal  component  analysis  (PCA)  experts,  which  requires  finding  the  proba- 
bility density  function  for  auto-association,  and  then  show  that  the  resulting  learning  equa- 
tions closely  parallel  those  of  a single  PCA  expert. 

When  applied  to  time  series,  PCA  becomes  temporal  PCA.  We  show  that  the  discrete 
prolate  spheriod  wave  functions  form  a natural  basis  for  the  eigenfunctions  in  the  fre- 
quency domain,  and  that  this  basis  explains  most  of  the  properties  of  temporal  PCA. 

Finally,  we  show  that  momentum  learning  is  equivalent  to  a recursive  mean  square 
error  estimator,  and  then  use  this  estimator  to  add  memory  to  the  gate,  which  improves 
segmentation  and  learning  by  giving  greater  validity  to  experts  that  have  performed  well  in 
the  recent  past. 
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CHAPTER  1 
INTRODUCTION 

1.1  Goal 

The  goal  of  this  dissertation  is  to  develop  a new  class  of  algorithms  for  non-stationary 
time  series  and  image  analysis  through  the  segmentation  of  data  into  stationary  regions 
and  modeling  of  the  resulting  segments,  all  in  a completely  unsupervised  fashion.  One 
must  always  make  some  assumptions  about  the  data  with  which  one  is  working.  Here,  we 
assume  that  the  data  are  piecewise  stationary,  with  switching  between  dynamical  systems 
occurring  rapidly,  but  where  the  switching  rate  itself  occurs  at  a much  slower  rate.  The 
dynamical  systems  which  produced  the  data  are  unknown,  as  are  the  switching  times. 

We  need  to  give  more  precise  definitions  of  some  of  the  terms  used,  but  first  it  is 
instructive  to  consider  some  real  data  that  can  be  regarded  as  at  least  piecewise  quasi-sta- 
tionary.  At  a small  enough  time  scale  (5-100  msec)  speech  can  be  roughly  regarded  as  a 
sequence  of  quasi-stationary  segments.  These  segments  can  be  considered  quasi-stationary 
because,  although  slowly  evolving,  they  have  been  fairly  successfully  modeled,  as  part  of 
a larger  speech  recognition  system,  by  simple  linear  auto-regressive  (AR)  systems  with 
constant  parameters,  driven  either  by  a pulse  train,  in  the  case  of  voiced  speech,  or  noise, 
in  the  case  of  unvoiced  speech.  Although  the  primary  focus  of  this  dissertation  is  time 
series,  most  of  what  will  be  presented  is  equally  applicable  to  images.  Some  textures 
within  an  image  can  be  regarded  as  stationary  in  the  sense  that,  on  a certain  scale,  the  local 
intensity  correlations  between  neighboring  pixels  are  relatively  constant  over  the  entire 
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texture.  An  image  is  considered  piecewise  stationary  if  the  transitions  between  textures 
occur  at  well  defined  boundaries. 


1 .2  Motivation 

There  are  many  applications  of  segmentation  and  identification  of  signals.  In  monitor- 
ing any  process,  such  as  the  operation  of  a jet  or  car  engine,  or  a machine  tool,  there  may 
be  several  stationary  regimes,  some  of  which  represent  in-process  states,  and  others  which 
represent  out-of-process  states.  The  detection  of  a change  in  stationarity  can  be  a possible 
indication  of  an  out-of-process  state,  and  identification  of  the  new  regime  can  help  deter- 
mine this,  and  thus  whether  additional  steps  need  to  be  taken. 

In  the  area  of  control,  a plant  may  have  several  stationary  regimes,  each  of  which 
requires  a specially  tuned  controller.  Detection  of  a change  in  the  plant  and  identification 
of  the  new  regime  can  be  used  to  switch  to  the  appropriate  controller. 

In  some  cases,  identification  of  the  stationary  regime  is  the  goal  itself.  In  the  case  of 
speech  recognition,  identification  of  the  quasi-stationary  segments  can  be  passed  on  to  a 
higher  level  processor  for  word  or  sub- word  recognition. 

1.3  Outline 

In  the  remainder  of  this  chapter,  we  will  define  some  of  the  terms  and  concepts  laid  out 
in  the  goal.  In  chapter  two  we  will  examine  the  classical  approaches  to  segmentation  and 
modeling.  In  chapter  three  we  will  present  a unified  framework  for  unsupervised  gated 
competitive  systems,  under  which  fall  all  the  new  algorithms  presented  in  this  dissertation, 
and  show  that  some  recent  algorithms  in  the  literature  also  fall  under  the  same  framework. 
In  chapter  four  we  will  examine  gated  competitive  principal  component  analysis  and  apply 
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it  to  image  segmentation.  In  chapter  five  we  will  apply  gated  competitive  principal  compo- 
nent analysis  to  time  series,  and  then  attempt  to  understand  the  results  in  terms  of  special 
functions  called  the  discrete  prolate  spheriod  wave  functions.  In  chapter  six  we  will  look  at 
various  ways  of  adding  memory  to  gated  competitive  systems,  as  well  as  novel  gate  struc- 
tures. Finally,  in  chapter  seven  we  will  draw  conclusions  on  the  present  work  and  suggest 
possible  lines  of  inquiry  for  future  work. 

1 .4  Stationaritv 

We  now  consider  a more  technical  definition  of  a stationary  signal.  To  do  so,  we  must 
first  review  the  concept  of  a random  process.  Imagine  sampling  some  property  of  a 
dynamical  system  over  a period  of  time,  so  that  we  have  a series  of  measurements,  x(tj). 
Now,  imagine  that  somehow  we  can  reset  the  dynamical  system  and  repeat  the  measure- 
ments. If  the  results  are  different,  then  the  process  is  random  and  not  deterministic.  A good 
example  of  this  is  the  many  ways  a single  speaker  can  say  the  same  word.  Because  of 
changes  in  stress,  co-articulation,  duration,  and  physical  condition,  no  speaker  ever  says 
the  same  word  in  exactly  the  same  way.  If  we  repeat  the  experiment  an  infinite  number  of 
times,  we  can  find  the  first  order  probability  density  function  (pdf)  of  the  signal  at  each 
sampled  time  instant,  p(x{tj))  Vi.  Now,  x(tt)  can  be  regarded  as  a random  variable.  A first 
order  stationary  process  is  one  whose  first  order  pdf  is  the  same  for  all  given  sampled 
times:  p(x(tj))  = p(x(tj))  Note  that  the  term  “first  order  stationarity”  refers  to  the  num- 
ber of  random  variables  only;  if  the  pdf’s  are  the  same,  all  higher  order  moments  must  also 
be  the  same:  £[xa(t-)]  = E[xa(tj)],  Vi,y. 

First  order  stationarity  does  not,  however,  say  anything  about  the  relationships  across 
time  between  the  random  variables.  For  that,  we  need  the  second  order  or  joint  pdf 
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between  any  two  sampled  time  instants , pix(tj)jc(tj)\  V/j.  A second  order  stationary’  pro- 
cess is  one  where  the  joint  pdf  between  two  sampled  time  instants  depends  only  on  the 
time  difference  between  them.  That  is,  the  joint  pdf  remains  the  same  for  all  constant  time 
differences:  p(x(t$Atj))  = P(x(tm)XQ)  when  tt  - t}  = tm  - t„.  Second  order  stationarity 
implies  that  all  joint  moments  will  only  be  a function  of  the  time  difference: 
£[/(/, .)At,)]  =f(ti-tj,  a,  b ),  ViJ. 

The  strictest  stationarity  requires  that  all  higher  order  pdf’s  be  invariant  to  uniform 
time  shifts  of  the  random  variables.  However,  estimating  joint  pdf’s  from  data  is  subject  to 
the  so-called  dimensionality  explosion,  requiring  an  exponentially  increasing  amount  of 
data  with  each  increase  in  pdf  order,  thus  rendering  impractical  higher  order  stationarity 
tests.  Even  estimating  second  order  pdf’s  is  difficult.  For  this  reason,  a lesser  requirement 
known  as  wide  sense  stationarity  is  often  employed,  for  which  it  is  only  necessary  that  the 
first  moments  of  the  first  and  second  order  pdf’s  be  independent  of  time  shifts: 
£[*(*,)]  = E[x{tj)]  and  £[x(/,)jc(/y)]  = /(7,-ij),  Vi J.  Note  that  the  latter  equation 
implies  that  the  second  moment  of  the  first  order  pdf  will  also  be  independent  of  time: 
E[x  (t,)]  = E[x  (*.)],  V/,y.  Because  Gaussian  processes  are  completely  characterized  by 
their  first  and  second  moments,  a wide  sense  stationary  Gaussian  process  is  also  second 
order  stationary. 


1.5  Ergodicitv 

There  is  another,  more  fundamental  problem  with  these  definitions  of  stationarity. 
They  often  cannot  be  tested  in  practice  because  many  physical  dynamical  systems  are 
uncontrolled  free-running  systems  which  cannot  be  reset,  and  thus  there  exists  only  one 
realization  of  the  random  process.  We  then  have  to  hope  that  the  time  average  statistics  of 
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the  single  realization  are  equivalent  to  the  process  average  statistics,  in  which  case  the  pro- 
cess is  called  ergodic.  Note  that  an  ergodic  process  is  stationary  but  a stationary  process  is 
not  necessarily  ergodic. 

If  we  have  a single  realization  of  a first  order  ergodic  process  sampled  at  discrete  time 
intervals,  we  can  then  estimate  both  the  first  order  pdf  p(x ) and  second  order  joint  pdf, 
p(x(ji)jc{n+m)),  with  the  same  confidence  as  for  multiple  realizations,  providing  we  sam- 
ple the  data  for  a sufficiently  long  time.  Joint  moments,  due  to  the  implied  second  order 
stationarity,  are  a function  of  the  discrete  lag  only:  E[xa(n)xb(n  + m )]  = f(m,  a,  b ) . The 
most  common  joint  moment  is  the  autocorrelation:  r{m ) = E[x(n)x(n  + m )] . Note  that 
we  are  often  only  interested  in  the  correlations,  and  thus  estimate  them  directly  from  the 
data,  bypassing  the  problem  of  estimating  the  pdf’s. 

To  be  fair,  there  are  cases  where  we  can  repeat  an  experiment  and  the  ergodic  assump- 
tion is  not  required.  Certainly  a speaker  can  repeat  a word  many  times.  However,  even  in 
such  cases  there  is  the  problem  of  time  alignment  and  warping.  For  example,  without 
proper  time  alignment,  the  fact  that  a word  is  composed  of  quasi-stationary  segments 
might  be  completely  obliterated  in  repeated  experiments.  Furthermore,  the  large  number 
of  trials  required  for  statistical  significance  is  often  impractical. 

Finally,  a note  about  human  perception  of  stationarity,  again  using  speech  as  an  exam- 
ple. It  is  worth  noting  that  a non-stationary  signal  may  not  necessarily  be  interpreted  as 
such  by  humans.  As  an  example,  a change  in  volume  in  the  middle  of  an  otherwise  quasi- 
stationary  segment  of  speech  is  readily  identified  as  conveying  the  same  information,  but 
in  fact  creates  a non-stationary  signal.  This  change  in  amplitude  has  increased  the  variance 
of  the  pdf  but  has  not  changed  the  generic  shape  of  the  pdf  itself. 
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1 .6  Piecewise  Stationaritv 

Having  defined  stationarity,  it  is  now  natural  to  define  piecewise  stationarity.  Gener- 
ally, there  are  two  frameworks  that  describe  piecewise  stationary  signals.  Common  to  both 
are  a generic  state  space  representation  of  a dynamic  model  described  by  a parameter  set, 
0,  and  an  internal  state  a:.  The  model  is  driven  by  an  excitation  source,  u(n).  A change  in 
either  the  parameter  set  or  the  excitation  source  may  create  a new  stationary  regime. 

The  first  framework  is  called  the  external  switching  framework  and  is  diagrammed  in 
Figure  1-1  (a).  At  a well  defined  transition  time,  the  switch  connects  to  a different 
dynamic  model,  producing  a new  stationary  regime.  The  second  is  called  the  parameter 
switching  framework  and  is  diagrammed  in  Figure  1-1  (b).  At  the  transition  time,  one 
parameter  set  or  excitation  is  simply  replaced  by  another. 


Figure  1-1.  Piecewise  stationary  models:  (a)  parameter  switching;  (b)  external  switching. 
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The  primary  difference  between  these  two  models  are  the  transient  effects.  If  the 
dynamic  model  has  memory,  then  the  parameter  switching  model  will  exhibit  transient 
effects  for  the  effective  memory  depth  of  the  system.  The  external  switching  model,  on  the 
other  hand,  will  never  exhibit  transient  effects.  Clearly  the  external  switching  model  is  an 
idealization. 

There  is  another  class  of  signals  that  are  called  quasi-stationary  signals.  In  these  sig- 
nals the  transition  between  regimes  occurs  gradually  rather  than  instantaneously.  The 
external  switching  model  can  be  altered  to  produce  a quasi-stationary  signal  by  replacing 
the  switch  with  a gradually  changing  mixture  during  the  transition  period.  The  parameter 
switching  model  can  be  altered  to  allow  the  dynamic  model  parameters  to  gradually 
evolve  over  time  from  one  set  to  another.  In  fact,  an  adiabatic  transition  between  different 
parameter  sets  will  generally  exhibit  less  severe  transient  effects  than  a quantum  switch. 

Using  speech  as  an  example  again,  in  a transition  between  two  voiced  speech  segments 
the  excitation  remains  nearly  the  same,  but  the  shape  of  the  vocal  tract  changes,  resulting 
in  a continual  change  in  the  parameter  set  of  the  corresponding  dynamic  model. 

1 .7  Tracking  vs.  Dividing  a Piecewise  Stationary  Signal 

The  question  now  becomes,  given  our  goal  of  segmenting  and  modeling  piecewise  sta- 
tionary signals,  how  can  we  accomplish  this?  When  confronted  with  a non-stationary  sig- 
nal that  changes  its  2nd  order  or  higher  statistics,  the  normal  approach  is  to  employ  a single 
adaptive  system.  An  adaptive  system  has  continually  adjusting  parameters  that  respond  to 
the  local  (in  time)  properties  of  the  signal.  In  order  to  be  local  in  time,  an  adaptive  system 
must  have  a finite  effective  memory.  That  is,  it  must  eventually  forget  what  it  has  learned 
in  the  past.  The  effective  memory  of  an  adaptive  system  may  be  governed  by  a learning 
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rate,  for  example,  in  the  case  of  the  least  means  square  (LMS)  filter,  or  a “forgetting  fac- 
tor”, in  the  case  of  a recursive  least  squares  (RLS)  filter.  However,  this  finite  memory  cre- 
ates a classic  trade-off  between  the  speed  of  convergence  and  the  misadjustment  after 
convergence.  Misadjustment  describes  the  variance  of  the  parameters  about  their  nominal 
solution.  In  terms  of  the  segmentation  and  modeling  problem,  this  translates  into  a trade- 
off between  the  accuracy  of  the  modeling  and  the  delay  in  detecting  a new  stationary 
regime.  If  the  effective  memory  is  too  short,  the  parameters  exhibit  a large  variance  about 
the  mean,  which  can  even  masquerade  as  a transition.  If  the  effective  memory  is  too  long, 
transition  times  become  uncertain  or  worse,  short  stationary  regions  may  be  missed 
entirely. 

There  is,  however,  a more  fundamental  limitation  in  using  a single  adaptive  system  to 
monitor  a non-stationary  signal.  The  normal  procedure  for  detecting  a change  involves 
monitoring  the  innovations  of  the  adaptive  system  for  deviation  from  whiteness,  or  a 
change  in  variance.  The  innovations  are  the  part  of  the  signal  that  cannot  be  explained  by 
the  model.  Unfortunately,  it  can  be  shown  that  the  sequence  of  innovations  of  a single  sys- 
tem is  an  insufficient  statistic  (see  Basseville  and  Nikiforov,  1993).  That  is,  there  is  a many 
to  one  mapping  from  different  stationary  regimes  to  identical  statistics  of  the  innovations. 
For  this  reason,  two  or  more  adaptive  models  are  required,  and  this  forms  the  basis  of  the 
classical  methods  of  segmentation  to  be  presented  in  chapter  two. 

The  classical  methods  are  sequential  in  nature,  in  that  information  from  prior  station- 
ary regions  is  discarded.  Under  the  production  models  of  Figure  1-1,  such  methods  are 
appropriate  whenever  the  potential  number  of  dynamical  models  is  very  large,  so  that 
there  is  a small  probability  of  returning  to  the  same  model.  In  this  case,  it  is  appropriate  to 
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view  the  data  as  a sequence  of  stationary  regions , and  the  segmentation  problem  becomes 
one  of  detecting  the  transition  between  two  stationary  regions. 

There  is  another,  fundamentally  different,  approach  to  dealing  with  piecewise  station- 
arity.  Instead  of  viewing  the  data  as  a sequence  of  stationary  regions,  we  model  the  data 
under  the  assumption  that  it  has  been  produced  by  switching  among  a finite  set  of  station- 
ary regimes.  This  approach  is  only  feasible  when  the  number  of  potential  regimes  is  rea- 
sonably small. 

Modeling  the  regime  instead  of  the  region  has  a number  of  advantages.  First  and  fore- 
most, if  several  discontiguous  regions  belong  to  the  same  regime,  information  from  all 
these  regions  can  be  used  to  improve  the  model  estimate.  Second,  a segmentation  system 
that  models  the  regimes  can  generalize  to  new  piecewise  stationary  processes  as  long  as 
they  are  generated  by  the  same  switching  model. 

The  development  of  various  algorithms  that  segment  a signal  by  modeling  the  regimes 
forms  the  fundamental  topic  of  this  dissertation. 


CHAPTER  2 

CLASSICAL  METHODS  OF  SEGMENTATION 
2.1  Classical  Methods  of  Segmentation 

Consider  the  following  problem:  given  a single  realization  of  a piecewise  ergodic  ran- 
dom process,  consisting  of  a finite  ordered  set  of  vectors,  segment  the  data  into  contiguous 
stationary  regions.  This  is  known  as  the  off-line  segmentation  problem.  We  do  not  know 
how  many  stationary  regions  are  contained  in  the  data,  nor  the  change  points  between 
regions.  Because  of  this,  the  problem  cannot  be  easily  formulated  as  a global  test  between 
known  hypotheses.  The  only  possible  approach  then  is  to  perform  sequential  detection  of 
change  points,  which  makes  the  off-line  segmentation  problem  similar  to  the  on-line  case, 
reducing  the  problem  to  one  of  detecting  successive  change  points  between  successive 
pairs  of  stationary  regions.  This  problem  was  first  addressed  by  Page  (1957)  but  the  algo- 
rithm we  will  now  examine  in  some  detail  was  first  presented  by  Brandt  (1982,  1983). 

In  the  absence  of  any  prior  information,  the  classical  approach  in  this  case  is  a general- 
ized likelihood  ratio  (GLR)  test  for  deciding  between  two  hypotheses:  the  null  hypothesis, 
Hq,  that  no  change  occurred  within  N samples,  and  the  posited  hypothesis,  H j,  that  a 
change  did  occur  at  the  intermediate  sample  time  T.  From  the  likelihood  ratio  of  these  two 
hypotheses,  a decision  function  is  found  that  forms  a change  detector.  When  the  decision 
function  exceeds  a preset  threshold,  a change  is  detected.  The  exact  transition  time  is  then 
determined,  and  the  algorithm  is  reset  and  started  anew. 
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2.2  Tests  for  a Change  in  First  Order  Statistics 
Consider  the  problem  of  detecting  a change  in  the  first  order  pdf  of  an  independently 
identically  distributed  (i.i.d.)  process,  assuming  knowledge  of  the  parametric  form  of  the 
pdf,  Pq(x),  characterized  by  the  parameter  set  0.  Since  we  are  interested  in  an  unsupervised 
algorithm,  we  do  not  know  the  parameters  of  the  pdf  before  the  change  point,  0O,  or  after 
the  change  point,  0j,  nor  the  change  time,  T.  The  likelihood  of  the  null  hypothesis,  Hq,  that 
no  change  occurred  within  N samples  is 

N 

l(ho)  = n *.«»»•  (2,i) 

n = 1 

Likewise,  the  likelihood  of  the  hypothesis  Hx  that  a change  in  the  pdf  occurred  at  time 
T is  given  by 


L{HX)  = 


■T-  1 
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Note  that  we  differentiate  between  the  parameter  set  under  the  null  hypothesis,  0O,  and  the 
parameter  set  before  change  under  the  change  hypothesis,  0'o,  since  they  are  estimated 
over  different  times.  The  log-likelihood  ratio  between  the  two  hypotheses  is  then 
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which  can  also  be  written 
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The  decision  function  is  then  formed  by  replacing  the  parameters  0O,  9'0,  and  0j,  in 
by  their  maximum  likelihood  (ML)  estimates,  0O,  0'o,  and  0i , over  their  respective 
regions  and  then  maximizing  that  with  respect  to  the  change  time  T 


g(N)  = max 


r [< 


Go,  0'o,  0i 


] = 


max  rmax0(  maxQ^min 


(2.5) 


A change  is  deemed  to  have  occurred  within  the  block  of  N data  if  the  decision  func- 
tion exceeds  some  predefined  threshold 


g(A 0 


(2.6) 


From  (2.4)  and  (2.5),  we  see  that  under  H0,  the  decision  function  tends  to  fluctuate  near 
zero,  and  under  Hh  the  decision  function  increases  with  increasing  N. 

To  summarize,  the  algorithm  starts  from  the  last  detected  change  point,  from  which  a 
small  block  of  A points  is  examined.  To  find  g(N),  the  two  hypotheses  are  compared  using 
(2.4)  for  all  possible  changepoints,  T,  within  the  block,  and  the  maximum  value  of  is 
assigned  to  g(N).  If  g(N)  is  less  than  the  threshold,  no  changepoint  is  detected,  N is  incre- 
mented, and  the  process  repeated.  If  g(N)  is  greater  than  the  threshold,  the  algorithm  is 
reset  and  started  anew.  A symbolic  diagram  of  this  is  shown  in  Figure  2-1  (a).  Regions 
where  the  number  of  samples  is  effected  by  T are  shown  with  solid  arrows,  while  regions 
effected  by  N are  shown  with  hollow  arrows. 

As  an  example,  consider  the  problem  of  detecting  a change  in  the  mean,  p,  of  an  inde- 
pendent identically  distributed  (i.i.d.)  Gaussian  process  under  the  assumption  of  constant 
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Figure  2-1.  Classical  change  point  detection:  (a)  three  model;  (b)  two  model;  (c)  three 
model  with  fixed  length  region  for  0^  (d)  two  model  with  fixed  length  region  for  0]. 

variance.  The  maximization  with  respect  to  the  parameters  in  (2.5)  is  explicit,  resulting  in 

the  following  decision  function 


g(JV)  = maxr[(7--l)m'0)2  + (JV-r+l)(nl)2-JV(t.0)2]. 


(2.7) 


Using  the  relationship  between  the  means,  fVp0  = (T—  l)|f0  + (N—  T+  l)pj , this  can 
be  further  simplified  to 


g(N)  = maxr 


N(N-T+l),  ,2' 
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(2.8) 
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Note  that  the  relationship  between  the  model  parameters  allowed  reformulation  of  the 
three  model  decision  function  in  terms  of  two  models. 


2.3  Computational  Complexity 

This  algorithm  is  computationally  intensive.  For  each  block  of  N points,  on  the  order 
of  N+2  new  ML  estimates  must  be  performed.  Extensive  use  of  recursive  ML  estimates 
can  significantly  reduce  the  computational  load.  However,  (2.4)  must  still  be  evaluated  N 
times.  A minor  simplification  can  be  made  by  assuming  that  the  null  maximum  likelihood 
estimates  are  approximately  equal,  9'o  « 0o,  resulting  in  the  two  model  GLR  algorithm 
shown  in  Figure  2-1  (b).  The  first  term  in  (2.4)  is  then  zero,  and  the  whole  equation  simpli- 
fies to 
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<=  I 'og 
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Pa  (*(»)) 

. Uo 


(2.9) 


Note  that  0o  is  still  estimated  over  all  N samples  but  the  GLR  is  only  evaluated  over  the 
last  N-T+\  samples. 

A more  dramatic  reduction  can  be  achieved  by  eliminating  the  maximization  over  T in 
(2.5)  by  fixing  the  length,  N-T+ 1,  of  the  data  segment  associated  with  the  parameter  set  0j. 
This  results  in  the  symbolic  diagram  shown  in  Figure  2-1  (c).  Finally,  these  two  simplifi- 
cations can  be  combined,  resulting  in  the  symbolic  diagram  of  Figure  2-1  (d).  Eliminating 
the  maximization  over  T also  allows  for  on-line  recursive  estimation  of  the  parameters. 


2.4  Estimation  of  the  Parameters 

Having  detected  a change,  the  problem  is  then  transformed  into  one  of  estimation  of 
the  parameter  sets  and  change  point.  Given  that  a change  was  detected  in  a sub-block  of 
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data  of  size  N,  the  problem  is  to  find  the  ML  estimates  of  the  unknown  parameters  sets  and 
the  change  time 


-T- 1 n y 

(T,  0o,ei)  = max^^ 

log 

n **«»>> 

.n  = 1 n = T 

(2.10) 


Again  replacing  the  parameter  sets  by  their  maximum  likelihood  estimates  over  their 
respective  regions,  this  can  be  simplified  to 
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(2.11) 


Note  that  the  first  term  does  not  depend  on  T,  and  thus  can  be  ignored.  What  remains  is 
identical  to  the  decision  function,  g(N),  for  the  two  model  case.  This  means  that  the  esti- 
mation of  the  parameters  can  be  done  in  conjunction  with  the  detection. 

The  close  relationship  between  detection  and  estimation  also  applies  to  the  on-line  and 
off-line  implementations.  In  the  on-line  case  the  algorithm  is  usually  re-started  from  the 
point  of  detection , which  inevitably  involves  a delay  from  the  changepoint.  In  the  off-line 
case,  the  algorithm  is  usually  re-started  from  the  ML  estimate  of  the  changepoint. 


2.5  Tests  for  a Change  in  Higher  Order  Statistics 
The  previous  formulation  does  not  work  when  trying  to  find  change  points  in  a higher 
order  joint  pdf  because  the  process  is  no  longer  i.i.d.  However,  if  we  can  find  an  inverse 
model  for  each  of  the  stationary  regimes,  the  output  of  the  inverse  filter,  the  so  called  inno- 
vations, is  i.i.d.  Then,  all  of  the  previous  development  still  holds  and  the  problem  becomes 
one  of  detecting  a change  in  the  pdf  of  the  innovations.  The  off-line  approach  to  detecting 
a change  in  an  ARMA  process  was  first  investigated  by  Kligiene  and  Telksnys  (1983). 
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When  the  model  that  produced  the  signal  is  known,  the  inverse  model  can  be  found 
analytically.  However,  in  most  cases,  we  do  not  know  the  original  model.  In  this  case,  the 
inverse  model  can  be  found  from  the  data  and  an  adaptable  model  that  learns  to  predict  the 
next  value,  as  shown  in  Figure  2-2.  If  the  parameter  set  is  adjusted  so  that  the  innovations 
are  completely  unpredictable,  then  all  possible  information  in  the  data  has  been  extracted. 
Then,  the  overall  transfer  function  from  input  to  residuals  must  be  the  inverse  of  the  trans- 
fer function  of  the  model  that  produced  the  data.  Note  that  if  the  excitation  of  the  model 
that  produced  the  data  has  a predictable  component,  that  will  be  modeled  as  well  by  the 
predictor. 


Figure  2-2.  Using  prediction  to  convert  a higher  order  process  to  an  i.i.d.  process. 


We  now  consider  the  problem  of  detecting  a change  in  the  residuals  when  their  pdf  is 
Gaussian.  It  is  easy  to  show  that  the  decision  function  in  the  three  model  case  is 

g(JV)  = maxr[Mog[CTo]  — (T—  l)log[a'o]  — (jV— T+  l)log[cn]] 
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where  cto  , a’o , and  cti  , are  the  variances  of  the  residuals  of  the  models  over  their  respec- 
tive regions.  Note  that,  unlike  the  first  order  case  where  there  is  usually  a simple  relation- 
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ship  between  the  parameters,  there  is  no  simple  way  to  relate  the  model  variances  to  each 
other.  As  in  the  first  order  pdf  test,  the  decision  function  tends  to  fluctuate  slightly  above 
zero  until  a change  is  detected,  at  which  point  it  rapidly  increases.  Also,  the  same  simplifi- 
cations can  be  made  to  reduce  the  computational  load. 

The  decision  function  in  (2.12)  is  exactly  the  same  formulation  for  detecting  a change 
in  the  variance  of  a first  order  Gaussian  pdf.  Thus,  this  test  cannot  distinguish  between  a 
change  in  the  signal  power  and  a change  in  the  model.  Rather  than  use  (2.12)  and  perform 
additional  tests  to  determine  the  type  of  change,  some  authors  have  suggested  decision 
functions  that  directly  incorporate  other  distance  measures  between  models.  In  the  case  of 
ARMA  modeling,  the  L2  norm  between  the  spectrums  can  be  used.  For  the  more  general 
case,  various  entropy  measures  between  the  pdf’s  can  be  employed. 

2.6  Simulations 

We  now  present  results  of  the  three  model  GLR  segmentation  algorithm  for  many  of 
the  time  series  to  be  presented  later  in  this  dissertation.  For  each  figure  on  the  following 
pages,  the  top  panel  shows  the  original  time  series  along  with  vertical  lines  representing 
the  known  changepoints,  as  determined  either  by  design  or  human  expert  scoring.  The  bot- 
tom panel  shows  the  decision  function,  with  vertical  lines  representing  the  detected 
changepoints.  Each  figure  caption  includes  information  on  the  order  (of  the  AR  predic- 
tors), threshold  (for  detection),  and  deadzone  (the  number  of  samples  added  after  a detec- 
tion). In  addition,  some  of  the  longer  time  series  have  an  increment  (the  number  of 
samples  added  after  no  detection).  After  a detection,  the  algorithm  was  reset  from  the 
point  of  detection. 


18 


Figure  2-3.  Switching  FIR  process:  (a)  time  series  and  segmentation;  (b)  decision  function 
and  segmentation  (order  = 8,  threshold  = 30,  deadzone  = 50). 


Figure  2-4.  Noisy  chirp  signal:  (a)  time  series;  (b)  decision  function  and  segmentation 
(order  = 8,  threshold  = 15,  deadzone  = 50). 
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Figure  2-5.  Earthquake  seismogram:  (a)  time  series;  (b)  decision  function  and  segmenta- 
tion (order  = 8,  threshold  = 120,  deadzone  = 20,  increment  = 10). 


Figure  2-6.  Speech  signal:  (a)  time  series  and  segmentation;  (b)  decision  function  and  seg- 
mentation (order  = 12,  threshold  = 400,  deadzone  = 50,  increment  = 100). 
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Figure  2-7.  Violin:  (a)  time  series  and  segmentation;  (b)  decision  function  and  segmenta- 
tion (order  = 8,  threshold  = 100,  deadzone  = 100,  increment  = 100);  (c)  noisy  time  series; 
(d)  decision  function  and  segmentation  (order  = 8,  threshold  = 30,  deadzone  = 100,  incre- 
ment = 1 00). 


CHAPTER  3 

A FRAMEWORK  FOR  UNSUPERVISED  GATED  COMPETITIVE  SYSTEMS 


3.1  Introduction 

We  saw  in  the  previous  chapter  how  classical  likelihood  ratio  techniques  can  be  used 
to  segment  and  model  a piecewise  stationary  signal  in  an  unsupervised  fashion.  In  the 
absence  of  apriori  information  about  the  number  of  stationary  regimes,  the  classical  tech- 
niques are  necessarily  sequential  in  nature.  That  is,  after  the  last  detected  change  point, 
they  concentrate  entirely  on  the  current  stationary  region,  continually  testing  for  a transi- 
tion to  a potentially  new  stationary  region,  and  do  not  use  any  information  from  prior 
regions. 

We  now  consider  the  case  where  the  stationary  regions  are  due  to  switching  between  a 
finite  set  of  unknown  generative  models.  We  still  consider  the  piecewise  stationary  case, 
where  only  one  model  is  active  at  a given  time,  but  now  a single  model  can  be  repeatedly 
active  at  different  periods  in  time.  Therefore,  in  designing  algorithms  for  this  case,  instead 
of  concentrating  on  stationary  regions,  we  need  to  concentrate  on  the  stationary  regimes 
that  produced  the  data. 

To  this  end,  we  now  introduce  the  competitive  multiple  model  approach,  where  several 
adaptable  “experts”  compete  to  explain  the  same  data.  The  most  successful  experts  are 
granted  larger  parameter  updates.  As  an  expert  begins  to  specialize  on  a particular  station- 
ary region,  it  is  then  at  a disadvantage  for  other  stationary  regions.  Such  an  approach  has 
two  key  components:  an  evaluation  mechanism,  through  which  the  relative  performance  of 
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the  experts  can  be  measured,  and  a competitive  framework,  which  governs  the  competition 
between  experts  by  comparing  the  experts’  performance  and  relating  that  to  their  adapta- 
tion to  the  data.  The  competitive  framework  is  implemented  as  a gate,  and  the  whole  archi- 
tecture called  “gated  experts”.  It  is  the  role  of  the  gate  to  determine  which  expert  is  valid  at 
the  present  time.  The  history  of  the  gate’s  decisions  effectively  segments  the  data.  It  is  the 
role  of  the  experts  to  learn  when  the  gate  instructs  them  to  do  so,  and  when  they  are  not 
learning,  to  remember  the  modeling  of  regimes  they  have  learned  in  the  past. 


3.2  Gated  Competition  of  Experts 

In  the  activation  phase,  the  total  system  output  is  a weighted  sum  of  the  experts’  out- 
puts 


K 

y ~ X Si<yk(x)  (31) 

k=  l 

where  yk  is  the  output  of  the  kth  expert  with  x as  input,  and  gk  is  the  kth  output  of  the  gate. 
In  order  for  the  mixture  to  be  meaningful,  the  gate  output  is  usually  constrained  to  sum  to 
one 


K 

X gk  = 1 • (3-2) 

k=  l 

Such  a linear  mixture  of  experts  can  represent  either  a competitive  or  cooperative  sys- 
tem, depending  on  how  the  system  learns,  as  specified  by  the  cost  function.  In  the  context 
of  introducing  their  Mixture  of  Experts  model,  Jacobs  et  al.  (1991)  first  presented  a cost 
function  that  encourages  competition  among  gated  expert  networks,  which  we  generalize 
to 
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K 

C = X Skf(d-yk(x ))  (3-3) 

k=  1 

where  d is  the  desired  signal.  The  evaluation  mechanism  is  embodied  in  the  term 
f(d—yk ) , a function  of  the  performance  or  error  of  the  kth  expert.  Since  the  desired  signal 
is  the  same  for  all  experts,  they  all  try  to  regress  the  same  data,  and  are  always  in  competi- 
tion. This  alone,  however,  is  not  enough  to  foster  specialization.  The  competitive  frame- 
work  is  embodied  in  the  gate  gk,  which  represents  some  measure  of  the  k expert  being 
the  correct  one. 

Typical  functions  forfQ  include  the  squared  error,  in  which  case  the  total  cost  function 
over  a data  set  of  N samples  is  the  sum  of  the  instantaneous  costs 

N K 

c = £ ^gk(n)[d(n)-yk(x(n))]2  (3.4) 

n=\k=  I 

or  a probability  density  function  in  the  error  which,  if  i.i.d.,  results  in  the  total  cost  func- 
tion as  a product  of  the  instantaneous  costs 

N K 

c=  n Yj gk(n)p(d(n)~yk(x(n)))-  (3-5) 

n = \k=  1 

The  formalism  represented  by  (3.3)  is  a supervised  algorithm,  in  that  it  requires  a 
desired  signal.  However,  we  are  interested  in  a completely  unsupervised  algorithm  for  per- 
forming segmentation  and  modeling.  A supervised  algorithm  becomes  unsupervised  when 
the  desired  signal  is  a fixed  transformation  of  the  input.  Although  many  transformations 
are  possible,  the  two  most  common  transformations  involve  the  delay  operator  and  the 
identity  matrix,  resulting  in  prediction  and  autoassociation , respectively.  Although  not 
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always  explicitly  stated,  throughout  this  work  it  is  to  be  understood  that  the  indication  of  a 
desired  signal  implies  some  transformation  of  the  input  d =>  d(x ) . 

3.3  Input  vs.  Output  Based  Gating 

All  the  algorithms  embodied  in  (3.3)  fall  into  two  broad  categories,  which  we  desig- 
nate as  input  or  output  based  gating,  and  are  shown  in  Figure  3-1 . For  output  based  gating, 
the  gate  is  a direct  calculated  function  of  the  performance,  and  hence,  the  outputs , of  the 
experts.  The  simplest  scheme  for  the  gate  that  falls  within  this  category  is  hard  competi- 
tion, for  which  the  gate  chooses  the  expert  with  the  smallest  magnitude  error  in  a winner 
take  all  fashion.  Other  output  based  gates  possess  memory  to  keep  track  of  past  expert  per- 
formance. Examples  of  algorithms  that  fall  into  this  category  include  the  Annealed  Com- 
petition of  Experts  (ACE)  of  Muller  et  al.  (1995),  Xu’s  (1994)  algorithm,  and  Self- 
Annealing  Competitive  Prediction  of  Fancourt  and  Principe  (1997a),  which  we  will 
present  in  chapter  six.  A more  elaborate  gate  with  both  memory  and  a spatial  structure  is 
used  in  the  Neighborhood  Map  of  Competing  Predictors  of  Fancourt  and  Principe  (1996a), 
also  to  be  presented  in  chapter  six. 

With  input  based  gating,  the  gate  is  an  adaptable  function  of  the  input  that  learns  to 
forecast  which  expert  will  perform  the  best.  The  best  example  of  input  based  gating  is  the 
Mixture  of  Experts  (MOE)  algorithm. 

3.4  Annealed  Competition  of  Experts 

Perhaps  the  best  known  example  of  output  based  gating  is  the  Annealed  Competition 
of  Experts  (ACE)  algorithm  of  Muller  et  al.  (1995).  A set  of  K expert  neural  predictors 
operate  in  parallel,  and  a gating  function  determines  the  probabilities  of  the  experts.  The 
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Figure  3-1.  Two  classes  of  gates  during  activation:  (a)  input  based;  (b)  output  based, 
total  cost  function  is  based  on  (3.4)  which  uses  the  standard  squared  error  as  the  evaluation 
mechanism.  The  competitive  framework  provided  by  the  gate  is  a Gaussian  function  of  the 
local  mean  square  error 

gk(n)  = Ae  (3.6) 

where  A is  a normalization  constant,  P is  an  annealing  term  which  determines  the  degree 
of  competition,  and  ek  is  related  to  the  local  mean  square  error  of  the  kth  expert,  calculated 
using  a non-causal  boxcar  filter 

A A 

e*(") = X = X [^(«-'0->t(«-'0]2 • (3-7) 

x = —A  x = —A 

The  gating  function  is  calculated  but  not  adapted.  Furthermore,  the  gradient  of  the 
cost  function  with  respect  to  some  weight,  wk,  in  the  kth  expert,  is  calculated  treating  the 


gate  as  a constant 
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N 

AWk  = ~r^l  = “i  X <3-8> 

K K 

n = 1 

where  r|  is  the  learning  rate.  Thus,  the  gate  effectively  represents  a weighting  term  to  the 
learning  rate  of  the  individual  experts. 

The  algorithm  must  be  annealed  during  training.  By  starting  the  learning  process  with 
a small  value  of  p,  all  the  predictors  see  the  entire  data  set  and  converge  to  an  average  of 
all  the  dynamical  systems.  Then  as  P is  increased  and  the  competition  becomes  harder, 
each  predictor  begins  to  specialize  on  a single  stationary  region  of  the  input,  which  then 
puts  it  at  a disadvantage  for  other  stationary  dynamical  regions,  allowing  other  predictors 
to  win  those  regions.  Muller  et  al.  report  “phase  transitions”,  specific  values  of  P for  which 
a dramatic  change  in  the  cost  occurs  as  the  experts  begin  to  specialize.  There  may  be  sev- 
eral such  phase  transitions  in  the  course  of  training.  As  P~ »oo,  the  competition  becomes 
hard,  and  only  the  winning  predictor  earns  the  right  to  update  its  prediction,  at  which  point 
the  experts  are  “fine  tuning”  their  modeling  of  their  respective  regimes. 

Using  non-linear  neural  predictors,  they  have  successfully  applied  this  algorithm  to  the 
segmentation  and  identification  of  switching  chaotic  maps  and  speech.  However,  the  algo- 
rithm is  still  dependent  on  manual  fine  tuning  to  determine  experimentally  the  memory 
depth  of  the  moving  average  filter,  and  the  annealing  rate. 

There  are  two  key  issues  that  are  essential  for  the  successful  application  of  the  algo- 
rithm. The  first  is  the  memory  term,  in  the  form  of  a lowpass  “boxcar”  filter,  for  the  instan- 
taneous squared  error  (3.7).  In  cases  where  there  is  a large  overlap  between  the  return 
maps  of  the  various  dynamical  regimes,  or  a small  signal  to  noise  ratio,  a single  time 
instance  is  insufficient  to  estimate  the  probabilities  of  the  experts.  This  can  result  in  rapid 
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switching  between  predictors.  However,  under  the  assumption  of  a relatively  slow  switch- 
ing rate,  a single  predictor  would  be  expected  to  win  for  a longer  time  period.  Increasing 
the  memory  depth  gives  a greater  advantage  to  the  predictor  that  has  won  in  the  recent 
past.  The  acausal  nature  of  the  filter  is  not  an  hindrance  because  of  the  off-line  nature  of 
the  algorithm,  but  ensures  that  regime  switches  indicated  by  the  gating  function  are 
aligned  with  the  data. 


3.5  Mixture  of  Experts 

The  Mixture  of  Experts  (MOE)  algorithm  is  an  example  of  input  based  gating.  We  will 
now  cover  it  in  some  detail  since  it  has  become  one  of  the  most  popular  competitive  algo- 
rithms, but  also  because  we  will  use  it  as  the  framework  for  competitive  principal  compo- 
nent analysis  in  the  next  chapter.  The  particular  version  we  will  examine  is  called  the  non- 
linear gated  experts,  after  Weigend  (1995a),  for  which  the  gate  is  a multi-layer  perceptron. 
For  the  MOE,  the  cost  function  (3.3)  is  viewed  as  a statistical  mixture  model 

K K 

p(d\x)  = P(k\x)p(d\x,  k)  = £ gk(x)p(d\x,  k ) (3.9) 

k= 1 k=  1 

where  the  kth  gate  output,  gk(x) , represents  the  apriori  probability  of  the  kth  expert  and 
p(d\x,  k)  is  the  conditional  pdf  of  the  desired  signal,  given  the  input  and  the  parameters  of 
the  kth  expert.  For  regression,  p(d\x,  k)  is  usually  modeled  as  a Gaussian  distribution  in 
the  error  between  the  desired  signal  and  the  k expert’s  output 


p(d\x,  k)  = 


vD/2|V 
(2k)  I, 


1/2 


exp 


-yk)% 


\d- 


-T*)] 


(3.10) 


where  Y.k  is  a free  covariance  parameter  to  be  determined. 
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3.5.1  Training  with  the  EM  Algorithm 

We  now  show  how  to  train  the  MOE  architecture  using  the  Expectation-Maximization 
(EM)  algorithm  in  an  off-line  fashion.  In  the  following  development,  for  ease  of  reading, 
we  leave  out  explicit  functional  dependence.  Thus,  it  is  assumed  that  gk(n)  =>  gk(x(n)) 
and  yk(n)  =>yk(x(n )) . Also,  there  is  an  implicit  iteration  index  in  all  the  following  equa- 
tions. Given  a finite  data  set  consisting  of  N input  vectors  and  N desired  vectors,  we  are 
now  in  a position  to  derive  equations  for  training  the  system.  The  free  parameters  to  be 
determined  are  the  weights  of  the  gating  and  expert  networks,  and  the  experts’  covari- 
ances. Following  Weigend  et  al.  (1995a),  the  total  likelihood  over  the  entire  data  set  is 
given  by 

N K N K 

l=  ni^*)w^)=  nnw»)^-)w-u)r  on) 

n = \k  = 1 n=  \ k=  \ 

where,  under  the  assumption  that  only  one  expert  is  valid  at  a given  time,  the  binary  indi- 
cator variable,  1^  indicating  which  expert  is  valid,  allowed  us  to  replace  the  sum  of  the 
likelihoods  over  the  experts  by  a product.  The  indicator  variable  is  “missing”,  in  the  sense 
that  we  do  not  know  apriori  which  expert  is  valid  at  any  time  step.  In  the  E step  of  the  EM 
algorithm,  for  a given  set  of  free  parameters  of  the  experts  and  gate,  the  entire  data  set  is 
evaluated,  holding  the  free  parameters  constant,  to  determine  x(n),  p(d(n)\x(n),  k ) and 
gk(n),  for  all  k and  n.  We  then  replace  the  indicator  variables,  Ik,  at  every  time  step,  by  their 
expected  value 


hk  = E[Ik\  = P(k\x,d)  = 


P(k\x)p(d\x,k) 

K 


gkP(d \x>  k) 

K 


Y P(k\x)p(d\x,  k ) Y,  SjP(d\x,  k) 

7=1  7=1 


(3.12) 
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Thus,  hk  is  the  posterior  probability  of  expert  k,  given  both  the  input  and  output. 

For  the  M step,  (3.1 1)  is  maximized  or  equivalently,  the  negative  log-likelihood 


N K 


c = £ X \-hk(n)log[gk(n)]  + -hk(n)[(d-yk)T'Lk\d-yk)  + \og\Ll\]  l (3.13) 


n = \k=  1 


is  globally  minimized  over  the  free  parameters.  The  process  is  then  repeated.  If,  in  the  M 
step,  (3.13)  is  only  decreased  and  not  minimized,  then  the  process  is  called  the  General- 
ized EM  (GEM)  algorithm.  This  is  necessary  when  either  the  experts  or  gate  is  non-linear, 
and  a search  for  the  global  minimum  is  impractical. 

The  first  term  in  the  summation  of  (3.13)  can  be  regarded  as  the  cross  entropy  between 
the  posterior  probabilities  and  the  gate.  It  has  a minimum  when  only  one  expert  is  valid, 
and  thus  encourages  the  experts  to  divide  up  the  input  space.  In  order  to  ensure  that  the 
outputs  of  the  gate  sum  to  unity,  the  output  layer  has  a “softmax”  transfer  function 


exp  [sk] 
Sk  k 

X exP[5;] 

j=  l 


(3.14) 


where  sk  is  the  kth  input  to  the  softmax.  For  a gate  implemented  as  a multilayer  perceptron, 
the  cross  entropy  term  in  (3.13)  cannot  be  minimized  in  a single  step  and  the  GEM  algo- 
rithm must  be  employed.  If  the  gate  is  trained  through  gradient  descent  (backpropagation), 
the  error  backpropagated  to  the  input  side  of  the  softmax,  from  (3.13)  and  (3.14),  is 

N 

||  =!><»)-*»(»).  (3-15) 

n = 1 
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This  is  the  same  equation  that  would  result  from  a mean  square  error  criteria  if  hk  is  inter- 
preted as  the  desired  signal  for  the  output,  gk,  of  a trainable  network.  Thus,  the  posterior 
probabilities  act  as  targets  for  the  gate. 

As  an  aside,  in  Xu’s  (1995)  algorithm,  the  gate  for  the  next  iteration  is  the  average  of 
the  posterior  probabilities  for  the  prior  iteration 

N 

Sk  = J^'Zh^-  (3-16) 

n = 1 

Thus,  the  gating  is  not  adaptable  but  is  calculated  and  thus  is  output  based. 

There  is  an  analytical  solution  for  the  experts  at  the  end  of  each  iteration  when  they  are 
linear  regressors,  yk  = Wjpc , namely: 

Wk  = Pfi,'  (3.17) 


N 


N 


X hk(n)xk(ri)xk(n) 


X hk(n)dk(n)xk(n) 


n — n — 1 

Kk 


N 

X ^(») 


p = n - 

rk 


N 

X hM 


(3.18) 


n = 1 n = 1 

Likewise,  there  is  an  exact  solution  for  the  covariance  parameter  lk  at  the  end  of  each 
iteration 

N 


v n — 1 

Lk 


N 

X hk (") 


(3.19) 


n = 1 
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3.5.2  Training  with  Gradient  Descent 

When  the  experts  are  non-linear,  or  on-line  learning  is  desired,  the  experts  can  be 

th 

trained  through  gradient  descent  on  some  weight,  wk,  in  the  k expert 


dC 

dwk 


g^^x' k) 

~k 


” = 1 Z 8jP(d\x>  k) 


j=  1 

All  other  equations  are  unchanged. 


(3.20) 


CHAPTER  4 

COMPETITIVE  PRINCIPAL  COMPONENT  ANALYSIS 
4.1  Introduction 

As  mentioned  in  the  first  chapter,  any  supervised  algorithm  can  be  converted  to  an 
unsupervised  algorithm  by  making  the  desired  signal  a fixed  transformation  of  the  input. 
In  this  manner,  Weigend  (1995a)  has  shown  that  the  normally  supervised  Mixture  of 
Experts  algorithm  is  capable  of  performing  unsupervised  segmentation  and  modeling  of  a 
time  series  when  the  experts  are  predictors  of  the  original  input  one  time  step  into  the 
future.  In  this  chapter,  we  seek  an  analogous  development  when  the  experts  are  linear 
auto-associators  that  reconstruct  the  input  from  a reduced  dimensional  space.  In  other 
words,  we  seek  to  combine  principal  component  analysis  (PCA)  with  the  Mixture  of 
Experts  formalism.  We  first  give  a brief  introduction  to  the  terminology  and  notation  and 
main  results  of  PCA,  without  proof. 

4.2  Principal  Component  Analysis 

Consider  expanding  the  random  vector  x(n),  of  dimension  D,  in  terms  of  some  fixed 
but  complete  basis  in  the  D dimensional  space 

D- 1 

*0 ) = Yj  zWui  = Uz (”)  (4-l) 

i = 0 

where  the  matrix  U = ^/0  Mj  ...  uD  _ ,J  is  deterministic  and  full  rank.  If  the  basis  is  con- 
strained to  be  orthonormal 
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u]Uj  = 5 y (4.2) 

multiplying  both  sides  of  (4.1)  from  the  left  by  UT  immediately  yields  the  coefficients  of 
the  expansion 

z,(«)  = ujx(n)  =>  z(n)  = UTx(n).  (4.3) 

If  the  expansion  is  performed  over  a subset  of  the  basis  functions,  the  result  is  an  esti- 
mate of  x 


p- 1 

x(n)  = £ zi(n)ui  = UpZp(n)  P<D  (4.4) 

« = o 

where  Up  is  a matrix  formed  from  a subset  of  the  columns  of  U.  Likewise,  the  subset  of  the 
expansion  coefficients  are  given  by 

Zp(n)  = UpX(n) . (4.5) 

Given  a set  of  random  vectors,  x(n),  n=\  ...N,  what  is  the  basis  that  minimizes  the  mean 
square  reconstruction  error  over  the  data  set,  £[||jr— jc||"]  , under  the  same  orthonormality 
constraints  as  in  (4.2)?  It  is  not  difficult  to  show  that  the  basis  must  satisfy  the  eigenvalue 
equation 

Rut  = Xiui  (4.6) 

where  and  m,  are  the  eigenvalues  and  eigenvectors,  respectively,  of  the  autocorrelation 
T 

matrix  R = E[xx  ] . The  P eigenvectors  are  chosen  on  the  basis  of  the  P largest  corre- 
sponding eigenvalues.  The  zi  are  called  the  principal  components.  The  minimum  mean 
square  reconstruction  error  is  given  by  the  sum  of  the  discarded  eigenvalues 
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D—  1 

M/w[£'[||x-x||~]]  = Y,  Xr  (4.7) 

i = P 

4.3  Competitive  PCA  Using  the  Mixture  of  Experts 
In  our  model,  the  experts  are  linear  PCA  networks.  These  experts  compete  for  the 
same  data  on  the  basis  of  their  performance.  A natural  measure  of  PCA  performance  is 
reconstruction  error,  which  becomes  the  competitive  mechanism.  By  analogy  with  auto- 
association, we  propose  to  utilize  the  input  itself  as  the  desired  signal,  and  the  recon- 
structed version  of  the  input  as  the  output  of  each  expert,  so  that  (3.3)  becomes 

K 

c = £g*/(*-**)  (4-8) 

k=  1 

where  xk  is  the  reconstructed  input  by  the  kth  PCA  expert.  As  previously  discussed,  there 
are  several  frameworks  applicable  to  competitive  PCA  that  can  be  described  by  (4.8). 
Dony  and  Haykin  (1995)  used  hard  competition  with  multiple  PCA  experts  for  image 
compression  and  segmentation.  However,  hard  competition  is  extremely  sensitive  to  initial 
conditions.  We  will  utilize  the  non-linear  gated  experts,  where  the  gate  is  a non-linear 
function  of  the  input  and  f(-)  is  an  appropriate  pdf. 

We  now  bring  together  the  Mixture  of  Experts  architecture  and  PCA.  The  basic  archi- 
tecture is  shown  in  Figure  4-1,  where  each  expert  is  a linear  PCA  network.  Each  expert 
has  two  outputs:  the  principal  components,  zk,  which  we  call  the  expert  system  output,  and 
the  reconstructed  input,  xk , which  we  call  the  expert  training  output.  It  is  rather  important 
to  understand  why  this  choice  was  made.  In  PCA  the  goal  is  to  either  utilize  the  principal 
components  or  the  eigenvectors  themselves.  But  it  is  not  the  principal  components  that 
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drive  the  competition;  rather,  it  is  the  reconstructions  of  the  input  by  the  experts.  This  is 
the  reason  the  block  diagram  of  Figure  4-1  differs  from  the  more  conventional  MOE  net- 
works which  display  a single  output.  The  gating  outputs,  gk,  weigh  both  the  system  out- 
puts, zk,  and  the  training  outputs,  xk . The  total  system  and  training  outputs  are  thus  given 
by 

K K 

z = £ gk(x)zk(x)  X = ]T  gk(x)xk(x) . (4.9) 

k= 1 k=  1 

The  gate  is  a multilayer  perceptron  which  receives  its  inputs  from  the  same  delay  line  as 
the  PCA  experts. 

Our  first  thought  was  to  use  the  full  set  of  PCA  components  to  effect  the  reconstruc- 
tion. However,  each  PCA  expert  is  capable  of  perfectly  reconstructing  the  input  (the  PCA 
transformation  is  full  rank),  so  there  is  no  information  to  drive  the  competition.  In  order  to 
create  a difference  in  performance  between  the  PCA  experts,  to  drive  the  adaptation  of  the 
gate,  one  has  to  perform  the  reconstruction  with  fewer  than  the  total  number  of  eigenvec- 
tors. That  is,  the  reconstruction  has  to  be  done  from  a subspace.  However,  one  can  still  uti- 
lize the  full  set  of  eigenvectors  or  principal  components  as  the  overall  system  output,  as 
may  be  required  by  PCA. 

4.4  The  PDF  of  Reconstruction  Error 

For  the  MOE,  it  is  necessary  to  specify  f(x—xk)  = p(d\x,  k),  the  conditional  pdf  of 
the  desired  signal,  given  the  input  and  the  parameters  of  the  kth  expert.  For  regression, 
p(d\x,  k ) is  usually  modeled  as  a Gaussian  distribution  in  the  error  between  the  desired 
signal  and  the  kth  experts’  output,  which  for  auto-association  becomes 
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Figure  4-1 . Competitive  PC  A network  implemented  using  the  Mixture  of  Experts  architec- 
ture. 


p(d\x,k)  = 


-I 


\D/2lv 

(2  n)  Z, 


,1/2 


exp 


--(*-**)  Z*  (. x-Xk ) 


(4.10) 


where  Z^  is  the  weighted  covariance  of  the  reconstruction  error  of  the  kth  expert.  Equation 
(4. 1 0)  is  perfectly  valid  when  the  experts  are  non-linear  auto-associators.  However,  as  we 
will  now  show,  when  the  experts  are  linear,  Z^  is  not  full  rank  and  thus  not  invertible.  Con- 
sider a single  PC  A system  with  eigenvectors  U = \lJp  t/J  , where  the  columns  of  U have 
been  separated  into  the  principal  and  secondary  eigenvectors,  respectively.  Noting  that 
UUT  = UpUp  + Uslfs  - /,  the  reconstruction  error  of  a single  input  can  be  written  as 


X(n)-Hn)  = x(n)  — UpUpx(n)  = [I - V pVTp]x(n)  = UsU*x(n)  (4.11) 

and  the  covariance  of  the  reconstruction  error  is  given  by 


Z = £[(*-i)(*-i)r]  = UsVTsE[xxTWsVTs  = V,v]RV,vl  = 


D—  1 

X (4  |2) 


i = P 
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Thus,  the  rank  of  the  covariance  matrix  is  given  by  the  number  of  secondary  eigenvec- 
tors, S = D — P.  Since  the  covariance  matrix  is  square  and  of  dimension  D,  it  is  not  full 
rank  for  any  dimensionality  reduction.  Ultimately,  all  such  direct  attempts  at  formulating 
PCA  in  a statistical  framework  fail  because  PCA  is  not  a model  for  the  data  but  merely  a 
transformation  of  the  data  based  on  a decomposition  of  the  covariance  matrix.  This  is  an 
undesirable  situation  because  we  would  like  to  use  the  same  statistical  framework  for  both 
linear  and  non-linear  auto-association.  One  way  around  this  is  to  borrow  some  results  from 
factor  analysis.  Factor  analysis  is  a model  that  attempts  to  explain  an  observed  high  dimen- 
sional random  vector  in  terms  of  an  unobserved  lower  dimensional  random  vector.  We 
now  show  that,  under  certain  simplifying  assumptions,  factor  analysis  defaults  to  a PCA 
expansion. 

Here  we  examine  the  conditions  under  which  factor  analysis  reduces  to  principal  com- 
ponent analysis,  for  the  purpose  of  finding  a statistical  framework  for  PCA.  According  to 
Morrison  (1976),  this  special  factor  model  was  studied  by  Lawley  (1953).  Here  we  present 
an  original  derivation.  We  assume  that  an  observed  Gaussian  random  vector, 
x ~ Nd{ 0,  Z),  can  be  explained  in  terms  of  a postulated  but  unobserved  lower  dimen- 
sional Gaussian  random  vector,  z ~ Np(0,  I) , such  that 

jc  («)  = Az(n)  + e(n)  (4.13) 

where  A is  a D x P matrix,  D>P,  and  e is  independent  Gaussian  noise,  e ~ Np(  0,  VP), 
with  T*  assumed  to  be  diagonal.  Given  these  assumptions,  then  x is  also  Gaussian  distrib- 
uted, x ~ Nd(0,  Z) , with  a population  covariance  given  by 

Z = E[xxT]  = AAT+'¥. 


(4.14) 
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Note  the  similarity  between  (4.13)  and  the  PCA  expansion  equation  (4.4).  However, 
(4.13)  is  a model  for  the  observed  data  x in  terms  of  the  latent  variable  z,  while  in  (4.4),  z 
is  simply  the  principal  components.  The  goal  of  factor  analysis  is  to  estimate  A and  S. 
This  can  be  done  by  maximizing  the  likelihood  of  the  sample  population  of  x under  the 
assumed  model  for  the  covariance.  For  a sample  data  set  of  A vectors,  the  resulting  likeli- 
hood of  the  observed  samples  is 

N N 


L = = 


D/ 2|v|l/2 
(271)  |Z| 


Exp 


--xT(n)I  '*(«) 


(4.15) 


and  the  negative  log-likelihood,  which  is  our  criteria  to  be  minimized,  can  be  written 

C = - Log[L ] = ^Log[  |I|]  + ^Tr  [/?!”']  + Constant  (4.16) 


where  R is  the  sample  covariance  of  x 

N 

R = 'Yj  •*(«)* r(«)-  (4-17) 

n = 1 

The  maximum  likelihood  solution  for  A and  I is  found  by  taking  the  partial  derivative 
of  (4. 1 6)  with  respect  to  those  parameters,  and  setting  the  results  equal  to  zero.  However, 
this  can  only  find  the  space  spanned  by  A,  since  any  orthogonal  rotation  of  z still  leads  to 
the  same  value  of  I.  An  additional  constraint  is  needed  to  find  a unique  solution  for  A. 
Here,  we  impose  two  constraints  that  are  much  stricter  than  those  normally  imposed  in 
factor  analysis:  namely,  that  the  columns  of  A = [a,  ...ap]  are  orthogonal,  and  that  the 
components  of  the  noise  have  equal  variance 


(4.18) 
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¥ = a2I.  (4.19) 

With  these  two  constraints,  factor  analysis  reduces  to  PCA.  We  will  now  outline  a 
proof  of  this.  Although  we  could  use  the  calculus  of  variations  and  append  the  constraints 
to  the  criteria,  it  is  more  instructive  to  use  them  explicitly.  The  postulated  population  cova- 
riance can  be  written 

p 

I = c2I  + Y aia]  ■ (4.20) 

i=  1 

The  first  P eigenvalues  of  S can  be  readily  found  by  multiplying  (4.20)  by  cij 
p 

Idj  - <y2aj+  Y ai(aIaj)  = ° 2(lj  + \ajtaj  = (°2  + KH2)fly  1 -7  - ^ • (4-21) 
/ = 1 

Simple  examination  of  (4.20)  shows  that  the  remaining  S = D-P  secondary  eigenvalues 
are  all  given  by  a2.  Since  the  determinant  of  a matrix  is  equal  to  the  product  of  its  eigen- 
values, the  determinant  and  log  determinant  of  I are  given  by 

p 

in  = oMn(“2+N2)  <4-22> 

i = 1 

P 

log|X|  = SlogCT2  + Y log[°2  + HI2]  • (4-23) 

/•  = 1 

The  constraints  in  (4.18)  and  (4.19)  allow  us  to  simplify  a well  known  matrix  inver- 
sion theorem 

-1  _ I A 

2 2 

a a 


1 + 


T ‘ 

A A 


-y 


>-z 


a a 

ii 


G~  + lla.ir 


(4.24) 
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Using  (4.23)  and  (4.24),  we  can  re-write  the  criteria  of  (4.16),  after  absorbing  con- 


stants, as 


C = SlogcT  + Y *°g[a~  + IKin  + ~2 

o 

i=  1 


Tr  [/?«  a7] 
Tr[/f]  — Y ~~2 


i=,G  +ifl«l 


(4.25) 


Note  that  direct  application  of  the  constraints  has  completely  decoupled  the  columns  of  A 
in  the  criteria.  Minimizing  (4.25)  with  respect  to  ak  results  in  the  equation 


(akRak)ak 


Ra, 


a + \\a 


ii 2 2 2 „ „2  2 2.2  m m2. 

o2(ct + N ) CT(a+WI) 


= 0 


(4.26) 


which  can  only  be  true  in  general  if  ak  is  an  eigenvector  of  R 

Rak  = 1 <k<P. 


(4.27) 


We  have  thus  shown  that,  under  the  assumptions  of  (4. 1 8)  and  (4. 1 9),  factor  analysis  is 
equivalent  to  a PCA  expansion.  Substituting  (4.27)  back  into  (4.26)  allows  us  to  solve  for 
the  magnitudes  of  the  eigenvectors 


* II2  _ , 2 

Ufa  ||  — Kj^  (T 


1 <k<P. 


(4.28) 


Substituting  (4.28)  into  (4.25),  and  making  use  of  the  theorem  that  the  trace  of  a matrix  is 
equal  to  the  sum  of  its  eigenvalues,  results,  after  simplification,  in  the  reduced  criterion 


p D 

C = Sloga2  + ^logA.  + P + l Y \ 

a 

i = 1 i = P + 1 


(4.29) 


Minimizing  (4.29)  with  respect  to  ct2  immediately  yields 


D 


-2 

a 


“ - I i, 


(4.30) 


i = P + I 
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which  is  the  average  value  of  the  secondary  eigenvalues.  We  have  now  fully  determined 
the  maximum  likelihood  solution  for  the  free  parameters  of  the  model,  A and  a . We  can 
gain  further  insight  into  the  solution  by  substituting  (4.28)  and  (4.30)  into  the  assumed 
model  (4.20)  for  the  covariance.  If  we  represent  the  columns  of  .4  in  terms  of  their  magni- 
tude and  a unit  direction,  ai  = ||aj  • ui,  where  the  m,  are  the  normalized  eigenvectors  of 
R,  then  the  model  covariance  and  its  inverse  evaluated  at  the  maximum  likelihood  solution 
can  be  written 


I = 


P D 

£^.(K,.|/f)  + G2  £ («,«[) 

i=l  i = P+\ 


(4.31) 


P D 

1 ' = + : 2 Z 

/ = i 1 ° i = p+ 1 


(4.32) 


Both  (4.31)  and  (4.32)  are  very  similar  to  the  spectral  decomposition  of  R and  its 

inverse,  respectively,  except  the  smallest  S eigenvalues  have  all  been  replaced  by  their 

* 2 

average,  o~.  Note  that  while  (4.31)  is  dominated  by  the  principal  eigenvalues,  (4.32)  is 
dominated  by  the  secondary  eigenvalues,  and  thus  we  can  approximate  the  maximum  like- 
lihood solution  for  the  inverse  covariance  by  the  second  term  only 


D 

V_l  1 V"  T - 1 rr  nT 

1 *72  Z “<"«  - 72 UsUs- 

CT  i = P+\  CT 


The  Mahalanobis  distance  for  a given  input  vector  x is  then  given  by 


T 7 1 1 Tit  ttT 

XL  X « —X  UsU'X. 

a 


(4.33) 


(4.34) 


However,  from  (4.1 1),  the  magnitude  squared  of  the  PCA  reconstruction  error  is  given  by 
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n 2 T Trr  ttT TT  jtT  Tjt  ttT 

Vj  = epcaepca  = x UsUsUsUsX  = x UsUsx  ■ 


(4.35) 


Substituting  (4.35)  into  (4.34),  we  can  write 


r"-’ 

XL  X ■ 


' pea 


• 2 

a 


(4.36) 


Returning  now  to  the  case  of  multiple  experts,  (4.36)  suggests  that  we  can  write  the 
pdf  of  the  reconstruction  error  of  the  kth  experts  as 


p(d\x,  k ) 


1 


2 S/2 
(2  nak) 


exp 


2a 


(4.37) 


k 7 


where  the  variance  is  given  by  the  mean  value  of  the  secondary  eigenvalues 


D—  1 

= 5 I V,  = S^lIMJI2]  (4-38) 

i = P 

and  Xfo  is  the  ith  eigenvalue  of  the  kth  expert.  Note  that  this  is  the  same  as  if  we  set  the 
covariance  of  the  reconstruction  error  in  (4.11)  to  have  the  diagonal  form 

(4.39) 

and  assume  S degrees  of  freedom.  Thus,  using  (4.39),  we  can  use  the  same  reconstruction 
error  model  independent  of  whether  the  experts  are  linear  or  non-linear.  It  is  natural  then 
to  question  the  consequences  of  not  using  the  exact  pdf  in  (4.37).  First,  even  the  exact  for- 
mulation makes  a Gaussian  assumption,  which  may  be  violated  by  the  data  itself.  Second, 
any  approximation  in  the  conditional  pdf’s  of  the  experts  can  be  accommodated  by  a small 
change  in  the  mixture  coefficients.  Third,  as  we  shall  soon  see,  the  approximation  leads  to 
a weighted  mean  square  error  derivation  of  the  linear  PCA  parameters,  similar  to  the  deri- 
vation for  a single  PCA  system.  Finally,  the  model  works  sufficiently  well  in  practice. 
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4.5  Solution  of  Competitive  PCA  Decomposition 
Having  found  the  appropriate  pdf  for  PCA,  the  negative  log-likelihood  cost  function  is 


N K 


c=  X X } ‘ M'OtogfejtOO]  + 


rt  = \k=  1 


V ; + Slog  a;  + Slog  2 71 


•.(4.40) 


We  will  now  show  that  the  competitive  PCA  network  will  find  the  principal  compo- 
nents of  each  stationary  segment  of  the  input  signal.  Since  PCA  is  a linear  transformation, 
we  can  globally  minimize  the  second  part  of  (4.40)  with  respect  to  the  experts’  weights. 
Since  the  experts  have  been  decoupled  in  (4.40),  we  can  minimize  that  part  of  the  cost 
function  with  respect  to  each  expert  separately 


N 


ck  = \ X hk(n)\\x(n)-xk(n)\\2 . 


(4.41) 


n = 1 


Let  the  weights  of  the  kth  expert  be  given  by  some  unknown  P x D matrix  Wk  so  that 
the  reconstruction  error  is  given  by 


x{n)-xM  = (. l-WkW‘k)x{n)  + bk 


(4.42) 


where  we  have  included  the  unknown  constant  vector  bk  to  allow  for  the  possibility  that  x 

iL 

is  not  zero-mean.  Then  the  cost  function  of  the  k expert  is 


Cy  = 


\ X hk{n){[x\n){I-Wkwl)  + bl][{I-Wkwl)x{n)  + bk}} 


n = 1 
N 


(4.43) 


1 


= 5 X hk(n){x\nXI-WkWl)x(n)  + 2X\n)(I-fVktVJk)bk  + bIkbk} 


where  the  orthogonality  condition,  Wk  Wk  = /,  allowed  the  simplification 
(/-  WkWTk)  = (/-  WkWTk) . Finding  the  constant  bk  first: 
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aT  = o =>  bk  = -(/-  pfX)^ 


(4.44) 


/v 


X hk(n)x(n) 


Z 


(4.45) 


We  can  now  re-write  the  cost  function  as 

N 

ck  = \Y. 

n = 1 

= Constant  • ^Tr[(/-  WkWTk)Rk] 


(4.46) 


where  /?£  is  a weighted  autocorrelation  matrix 

N 

Z hk(n)l-x(n)-Xk][x(r*)-Xk]T 

**■— AT • (4-47) 

Z 

n = 1 
• th 

Let  the  i column  of  be  given  by  wki.  Since  the  columns  of  Wk  are  already  decou- 
pled in  (4.46),  we  need  only  append  a normality  constraint  to  the  cost  function,  which 
becomes,  after  absorbing  constants 

p P 

ck  = \^[Rk\-\ z TriwkiwkiRk\ + \ Z Mvr1)'  (4-48) 

/ = i i = i 


The  minimum  of  (4.48)  occurs  when  the  wkj-  are  chosen  to  be  eigenvectors  of  Rk 
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8Ck 

— = ° =>  Rkwkj  = Xkjwkj 
kj 


(4.49) 


but  chosen  so  that  the  eigenvectors  correspond  to  the  largest  eigenvalues.  That  is  to  say,  for 
the  optimal  reconstruction  of  the  kth  expert  at  each  M step,  we  should  choose  the  P largest 
eigenvectors  of  Rk.  This  clearly  shows  that  the  competitive  PCA  architecture  is  still  solv- 
ing an  eigenvalue  problem.  Furthermore,  if  the  posterior  probabilities  converge  to  binary 
values  indicating  the  stationary  regions  of  the  time  series,  then  each  experts’  autocorrela- 
tion matrix  and  eigenvectors  will  correspond  to  a different  stationary  regime. 

Finally,  at  each  iteration,  the  variance  of  each  experts’  pdf  is  set  to  the  average  of  the 
discarded  eigenvalues,  according  to  (4.38).  We  now  summarize  the  algorithm  in  pseudo- 
code. 


For  each  iteration,  i: 

E-Step: 

For  each  vector  in  the  data  set,  n = 1...N 
1.  Calculate  the  gate  outputs:  gk(x(n)) . 

For  each  expert,  k = 1...K 

T T - 

2.  Calculate  the  reconstruction  estimate:  xk(n)  = tV.tr.x(n)  + (/-  Wk  Wk  )xk . 


3.  Evaluate  the  pdf  of  each  expert:  p(d(n)\x(n),  k) 


2 S/2 
(2nok) 


exp 


2-1 


2oT 


gk(n)p(d(n)\x(n),k) 


4.  Calculate  the  posterior  probabilities:  hk(n)  = — 

Y Sj(.n)p(.<t(n)\x(n),  k) 
j=  1 

M-Step: 

5.  Train  the  gate  using  the  posterior  probabilities,  h/Jn),  as  targets  at  the  input  side  of  the  soft-max  out- 
put layer. 

For  each  expert,  k = J...K 


6.  Calculate  the  conditional  means:  xk  = 


N 

X hk(n) 

-n  - 1 

N 


-H  V 

Y hk(n)x(n). 


n = 1 


7.  Calculate  the  autocorrelation  matrices:  RL 


-i-i  N 


Y hk(n) 

-n  = 1 


Y hk(nAx(n)~xk][x(n)-xk]r . 


n = 1 
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8.  Calculate  the  eigenvectors,  j = Rk*vkJ  = A kjwkj ■ 

D—  1 

2 J 

9.  Calculate  the  variance:  ak  = - ^ 

i = P 

The  overall  output,  as  defined  by  (4.9),  can  be  calculated  after  the  algorithm  has  con- 
verged. 


4.6  Practical  Implementation  Issues 

4.6.1  Applicability 

We  envision  three  primary  applications  of  competitive  Temporal  PC  A:  signal  segmen- 
tation, noise  reduction,  and  adaptive  encoding. 

4.6.2  Network  design 

In  designing  a system,  the  following  parameters  must  be  determined:  The  number  of 
experts,  the  number  of  principal  components  per  expert,  and  the  number  of  processing  ele- 
ments of  the  gating  network. 

The  number  of  experts  should  be  chosen  so  as  to  match  the  number  of  stationary 
regimes.  In  some  cases,  such  as  with  speech,  the  number  of  stationary  regimes  can  be  esti- 
mated fairly  well  beforehand.  In  most  cases,  however,  the  number  is  unknown.  In  these 
cases,  pruning  or  growing  algorithms  can  be  employed,  but  are  beyond  the  scope  of  this 
paper. 

The  number  of  principal  components  required  per  expert  should  be  chosen  on  the  basis 
of  the  number  of  experts.  With  just  a few  experts,  the  first  two  or  four  principal  compo- 
nents will  almost  always  provide  sufficient  differentiation  between  regimes.  Large  num- 
bers of  experts  may  require  more  principal  components. 

Spurious  switching  can  be  influenced  by  the  size  of  the  gating  network.  Recall  that  the 
posterior  probabilities  act  as  a target  for  the  gate.  If  the  gate  has  too  many  free  parameters, 
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it  will  tend  to  overfit  the  posterior  probabilities,  which  can  exhibit  spurious  switching 
errors  due  to  noise.  In  this  case,  the  gate  will  simply  learn  the  switching  errors.  On  the 
other  hand,  a gating  network  with  a minimal  number  of  free  parameters  is  less  likely  to 
memorize  spurious  switching  errors,  and  thus  is  more  likely  to  provide  a smoother  seg- 
mentation. Thus,  it  is  important  to  find  the  gating  network  with  the  minimal  number  of 
free  weights  that  can  still  learn  the  posterior  targets  reasonably  well. 

4.6.3  Initialization 

The  weights  of  the  gating  network  should  be  initialized  to  small  values  so  that  the  gate 
estimates  all  experts  as  being  nearly  equiprobable  over  the  entire  data  set.  The  expert  vari- 
ances should  be  initialized  to  large  values,  to  express  uncertainty  in  the  initial  weights  of 
the  PCA  experts. 

There  are  several  different  ways  to  initialize  the  PCA  experts.  From  the  global  KLT  for 
the  entire  data  set,  the  global  eigenfilters  can  be  calculated.  In  one  scenario,  all  the  expert 
eigenfilters  can  be  made  equal  to  the  global  eigenfilters,  plus  small  random  perturbations. 
In  practice,  this  is  equivalent  to  a completely  random  initialization.  This  is  because  early 
in  the  training,  the  gate  estimates  all  experts  as  being  equally  probable,  and  since  the 
expert  variances  are  initially  large,  the  experts’  posterior  probabilities  are  nearly  identical 
over  the  entire  data  set.  Thus,  at  least  after  the  first  iteration  through  the  data  set,  the 
experts  approximate  the  global  eigenfilters.  Another  initialization  technique,  which  we 
prefer,  is  to  assign  the  1 st  and  2nd  global  eigenfilters  to  the  first  expert,  the  3rd  and  4th  glo- 
bal eigenfilters  to  the  second  expert,  etc.  The  remaining  eigenfilters  of  each  expert  are  then 
initialized  to  normalized  random  vectors. 
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4.6.4  Training  Issues 

In  our  experience,  the  pdf’s  dominate  the  calculation  of  the  posterior  probabilities.  The 
gate’s  contribution  to  the  posterior  probabilities  during  training  is  primarily  to  reinforce 
the  pdf’s,  by  learning  the  posterior  probabilities,  and  to  interpolate  at  the  boundaries 
between  regimes.  Therefore,  at  a particular  training  iteration,  it  is  not  necessary  to  train  the 
gate  to  full  parameter  convergence. 

For  some  harder  problems,  we  find  that  annealing  the  number  of  principal  compo- 
nents, starting  from  P-2  and  progressing  up  to  the  final  dimensionality  reduction,  yields 
better  results. 

4.6.5  Repeatability  and  convergence 

We  wish  to  emphasize  that  not  all  training  runs  converge  to  a reasonable  solution. 
However,  degenerate  solutions  are  usually  quite  obvious,  often  resulting  in  a very  rapid 
switching  between  experts  every  few  samples,  with  no  visible  segmentation.  Among  runs 
that  converge,  we  find  that  they  produce  qualitatively  similar  segmentations.  However, 
there  can  be  differences  in  the  exact  location  of  major  regime  transitions  and  in  the  amount 
of  spurious  switching.  As  with  all  complex  adaptive  systems,  training  must  be  repeated 
several  times  and  the  results  compared.  However,  since  the  algorithm  is  unsupervised,  care 
must  be  exercised  in  choosing  the  best  training. 

4.7  Extensions 

4.7.1  Minimum  eigenvalue  approach 

Pisarenko’s  (1973)  harmonic  retrieval  (PHR)  method  is  an  eigenanalysis  technique  for 
the  special  case  when  a time  series  consists  of  superimposed  sinusoids  in  additive  noise.  It 
can  be  shown  that,  for  a given  eigenvector  w,-  with  associated  eigenvalue,  Xh  all  eigenvec- 
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tors  uj  with  associated  eigenvalue  Xj,  where  j>i,  will  have  a zero  at  the  location  of  the  peak 
of  the  spectrum  of  In  practice,  this  also  holds  true  for  more  complex  signals.  Thus, 
because  of  the  orthogonality  constraint,  the  eigenvector  associated  with  the  smallest  eigen- 
value essentially  contains  information  about  all  the  other  eigenvectors.  The  MUSIC  algo- 
rithm exploits  this  by  looking  at  the  inverse  spectrum  of  the  eigenfilters  in  the  noise 
subspace  (those  with  the  smallest  eigenvalues). 

Inspired  by  Pisarenko’s  work,  instead  of  using  the  experts’  reconstruction  error  as  the 
competitive  basis,  we  seek  a cost  function  that  is  minimum  for  an  orthogonal  projection 
between  the  input  and  a weight  vector.  Thus,  if  the  output  of  a single  expert  is 

y(n ) = n>Tx(n ) (4.50) 

then  y2  would  be  the  competitive  metric.  In  the  general  case,  minimizing  v“  over  the  entire 
data  set,  subject  to  the  constraint  w w = 1 , is  equivalent  to  finding  the  eigenfilter  associ- 
ated with  the  smallest  eigenvalue 

Min„{£[y2]}  = MmJE[wTRw]}=>RW  = Xminw.  (4.51) 

While  the  maximum  eigenfilter  can  be  regarded  as  a matched  bandpass  filter,  the  min- 
imal eigenfilter  can  be  regarded  as  an  inverse  filter.  The  minimal  eigenfilter  can  also  be 
found  adaptively  on-line  through  anti-Hebbian  learning.  In  all  cases,  integration  with  the 
Mixture  of  Experts  formalism  is  straightforward. 

4.7.2  Modeling  Power  Variations  Within  a Regime 

It  is  often  desirable  to  recognize  a particular  stationary  regime  independent  of  its 
power.  Although  matched  eigenvectors  are  invariant  to  changes  in  the  power  of  a signal, 
the  eigenvalues  are  not.  Since  the  variance  is  equal  to  the  mean  of  the  discarded  eigenval- 
ues, power  variations  can  affect  the  expert  pdf’s  and  posterior  probabilities.  A single  vari- 
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ance  estimate  for  each  of  the  experts  may  be  inadequate  in  this  case.  An  alternative 
approach  is  to  model  the  variance  as  a function  of  the  input:  ak  = ak(x) . This  can  be 
accomplished  with  a neural  network  that  learns  to  predict  the  variance  on  a sample  by 
sample  basis  (Bishop,  1995). 

4.7.3  Principal  Sub-space  Decomposition 

One  extension  that  has  already  been  mentioned  is  to  use  a principal  sub-space  decom- 
position, instead  of  a principal  component  decomposition.  This  may  speed  up  on-line 
learning,  since  it  is  no  longer  necessary  that  the  eigenvectors  converge  in  sequence,  as  is 
required  for  the  on-line  PCA  algorithms.  Principal  sub-space  decomposition  also  allows 
the  use  of  non-linear  expert  networks. 

4.7.4  Reducing  Spurious  Gate  Switching 

If  signal  segmentation  is  the  primary  goal,  then  it  is  desirable  to  remove  spurious  gate 
switching.  One  way  is  to  accomplish  this  is  to  low  pass  “filter”  the  posterior  probabilities, 
by  computing  the  joint  posterior  probabilities  for  neighboring  time  steps,  and  then  use 
these  as  the  target  for  the  gate.  A second  way  to  remove  spurious  gate  switching  is  to  add 
some  form  of  memory  to  the  gating  network.  In  the  context  of  competitive  prediction, 
Weigend  (1995a)  used  both  global  feedback  of  the  gate  output  to  the  input,  as  well  as  an 
exponential  memory  on  the  hidden  units  of  the  gating  network,  and  found  that  the  latter 
approach  worked  better. 

4.7.5  On-line  learning 

When  gradient  descent  on  the  global  cost  function  (4.40)  is  used  for  adaptation,  all  the 
free  parameters  of  the  competitive  PCA  architecture  can  be  updated  on  a sample  by  sam- 
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pie  basis.  Sanger’s  rule  computes  the  principal  components  in  an  on-line  approach.  The 
matrix  form  of  Sanger’s  rule 

A fV(n)  oc y(n)xT  — LT[y(n)yT (n)]  W(n)  (4.52) 

in  the  MOE  framework  becomes 

AWk(n)Kh^{yk(n)xT-LT\yk(n)yTk(n)]Wk(n))  (4.53) 

where  LT  denotes  the  lower  triangular  operator,  which  sets  all  elements  of  the  matrix 
above  the  diagonal  equal  to  zero.  The  learning  rates  of  the  gate  and  of  the  expert  networks 
must  be  carefully  coordinated. 

4.8  Application  to  Images 

4,8.1  Texture  Segmentation 

We  applied  the  model  to  a 504x504  grayscale  image  consisting  of  five  textures,  shown 
in  Figure  4-2.  The  training  set  consisted  of  1,764  non-overlapping  sub-blocks  of  12x12 
pixels.  There  were  five  linear  PC  A experts,  each  performing  a 144  to  16  dimensionality 
reduction  (and  a 16  to  144  expansion  for  reconstruction).  The  gating  network  was  a two 
hidden  layer  multi-layer  perceptron  with  15  and  10  hyperbolic  tangent  activation  functions 
on  the  hidden  layers,  respectively,  resulting  in  an  overall  architecture  of  144-15-10-5. 

The  output  of  the  gate  after  training  is  shown  in  Figure  4-3  and  the  posterior  probabil- 
ities, which  act  as  a target  for  the  gate,  are  shown  in  Figure  4-4.  For  each  figure,  panel  (f) 
represents  the  winner-take-all  segmentation,  grayscale  coded  to  indicate  the  winning 
expert  for  each  sub-block.  From  these,  we  see  that  the  image  has  been  fairly  successfully 
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Figure  4-2.  Training  image. 

segmented,  with  most  of  the  errors  occurring  in  the  south  texture.  Also,  the  gate  has 
learned  the  posterior  probabilities  very  well. 

Figure  4-5  shows  the  masks  (the  PCA  weights  arranged  as  a matrix  and  gray  scale 
coded  as  to  value)  of  the  five  experts.  The  experts  are  ordered  from  top  (expert  1)  to  bot- 
tom (expert  5),  and  the  masks  are  ordered  from  the  largest  (left)  to  smallest  (right)  corre- 
sponding eigenvalues.  Only  the  first  eight  of  the  sixteen  masks  are  shown.  We  see  that 
each  expert  clearly  specialized  on  a particular  feature  of  a specific  texture. 
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Figure  4-3.  Training  image  gate  segmentation:  (a)-(e)  gate  outputs  1-5  corresponding  to 
experts  1-5,  respectively;  (f)  winner  take-all  segmentation. 


Figure  4-4.  Training  image  posterior  probabilities  segmentation:  (a)-(e)  gate  outputs  1-5 
corresponding  to  experts  1-5,  respectively;  (f)  winner  take-all  segmentation. 
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Figure  4-5.  (a)-(f)  First  eight  PCA  masks  for  experts  1-5,  respectively. 

We  tested  the  trained  model  on  the  504x504  grayscale  image  shown  in  Figure  4-6.  It 
consists  of  five  patterns,  two  of  which  were  also  present  in  the  training  image,  and  one  of 
which  is  very  similar  to  a texture  in  the  training  image. 

The  testing  results  for  the  gate  and  posterior  probabilities  are  shown  in  Figure  4-7  and 
Figure  4-8,  respectively.  We  see  that  the  gate  has  done  only  a fair  job  of  segmenting  the 
test  image,  with  expert  1 fully  recognizing  the  movement  of  the  north  texture  in  the  train- 
ing image  to  the  east  texture  in  the  testing  image,  but  only  partially  recognizing  the  move- 
ment of  the  west  texture  in  the  training  image  to  the  north  position  in  the  testing  image. 
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Figure  4-6.  Testing  image. 

However,  the  posterior  probabilities  demonstrate  a much  better  segmentation.  Even  the 
center  textures,  which  are  similar  but  not  completely  equivalent  between  the  training  and 
test  images,  were  recognized  as  belonging  to  the  same  texture  class. 

4.8.2  Real  Image  Segmentation 

We  then  applied  the  model  to  a 560x704  grayscale  image  of  a real  world  scene,  shown 
in  Figure  4-9,  consisting  of  a sloped  residential  street  in  the  foreground  with  rolling  hills 
and  clear  sky  in  the  background.  The  training  set  consisted  of  6,160  non-overlapping  sub- 
blocks of  12x12  pixels.  There  were  eight  linear  PC  A experts,  each  performing  a 64  to  16 
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Figure  4-7.  Test  image  gate  segmentation:  (a-e)  gate  outputs  1-5  corresponding  to  experts 
1-5,  respectively;  (f)  winner  take-all  segmentation. 


Figure  4-8.  Test  image  posterior  probabilities  segmentation:  (a-e)  posterior  probabilities 
corresponding  to  experts  1-5,  respectively;  (f)  winner  take-all  segmentation. 
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Figure  4-9.  Real  world  training  image. 

dimensionality  reduction  (and  a 16  to  64  expansion  for  reconstruction).  The  gating  net- 
work was  a two  hidden  layer  multi-layer  perceptron  with  48  and  32  hyperbolic  tangent 
activation  functions  on  the  hidden  layers,  respectively,  resulting  in  an  overall  architecture 
of  64-48-32-8. 

The  results  after  training  for  the  gate  and  posterior  probabilities  are  shown  in  Figure  4- 
10  and  Figure  4-11,  respectively.  From  the  figures,  we  see  that  the  PCA  experts  have 
begun  to  specialize  on  different  textures  within  the  image.  For  example,  expert  one  (a)  has 
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Figure  4-10.  Real  world  image  gate  segmentation:  (a-h)  gate  outputs  1-8  corresponding  to 
experts  1-8,  respectively;  (i)  winner  take-all  segmentation 

specialized  on  the  distant  fields  near  the  horizon,  expert  four  (d)  on  the  roofs  of  the  houses, 

expert  five  (e)  on  the  sky,  and  expert  six  (f)  on  certain  windows. 

It  is  also  apparent  that  the  gate  has  not  converged  to  the  posterior  probabilities,  in  spite 

of  the  gate  having  almost  4,900  adaptable  weights.  Clearly  this  is  difficult  problem  for  a 

neural  network  to  learn.  Still,  the  gate  provides  a critical  smoothing  function. 
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Figure  4-11.  Real  world  image  posterior  probabilities  segmentation:  (a-h)  gate  outputs  1-8 
corresponding  to  experts  1 -8,  respectively;  (i)  winner  take-all  segmentation 

4.9  Conclusion 

It  is  clear  that  competitive  PCA  is  not  only  doing  data  reduction  but  also  segmentation. 
This  is  an  important  observation  since  conventional  PCA  treats  the  entire  image  as  a 
unique  class,  i.e.  it  is  limited  to  the  representation  of  the  image.  With  our  scheme  each 
PCA  component  can  represent  a subset  of  the  image  patterns.  Competition  is  utilized  to 
find  each  piece,  and  thus  the  method  brings  discrimination  into  PCA  analysis,  which  has 
not  been  done  in  the  past. 


CHAPTER  5 

TEMPORAL  PRINCIPAL  COMPONENT  ANALYSIS 

5.1  Application  of  Competitive  PC  A to  Time  Series 
We  now  consider  the  application  of  competitive  PCA  to  the  case  when  the  system 
input  is  a time  series  fed  through  a tapped  delay  line.  The  architecture  of  each  expert  is 
then  as  shown  in  Figure  5-1,  which  we  call  temporal  PCA. 


Figure  5-1.  Temporal  PCA  architecture:  (a)  PCA  network;  (b)  reconstruction  network. 


In  this  case,  the  autocorrelation  matrix,  R,  of  dimension  N,  is  a Toeplitz  matrix  with 
components  given  by  the  temporal  autocorrelation  function,  Rmn  = r(m  — n).  The  solu- 
tions to  the  eigenvalue  equation,  Ru  = Xu , are  called  eigenfilters,  since  they  act  as  time 
domain  filters  for  the  time  series. 
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5.2  Competitive  PC  A Simulations 

We  now  present  three  simulations  to  demonstrate  the  properties  of  the  algorithm;  one 
with  artificially  created  data,  and  the  others  with  real-world  data. 

5.2.1  Signal  Segmentation:  Switching  FIR  Process 

We  first  tested  the  algorithm  on  a toy  problem  consisting  of  a switching  FIR  process.  A 
time  series  was  generated  by  a Gaussian  noise  driven  FIR  filter  with  20  taps.  Every  100 
samples,  the  filter’s  coefficients  were  switched  between  one  of  two  normalized  weight 
sets.  The  total  time  series  generated  was  400  samples  long.  The  resulting  time  series  is 
shown  in  Figure  5-2. a,  along  with  lines  indicating  the  segmentation.  The  time  series  is  dif- 
ficult to  segment  by  eye.  Depending  on  the  local  properties  of  the  driving  noise,  different 
spectral  peaks  can  be  excited  within  the  same  stationary  region.  There  were  two  experts, 
each  performing  a dimensionality  reduction  of  20:6.  The  gate  was  a TDNN  with  two  hid- 
den layers  and  architecture  20-4-3-2.  The  gate  and  experts  were  trained  for  200  epochs  of 
the  EM  algorithm.  For  each  M step,  the  gate  was  trained  for  10  epochs.  A reasonable  seg- 
mentation was  achieved  after  only  25  EM  iterations,  but  with  additional  training  the  transi- 
tions became  better  defined  and  spurious  gate  switching  errors  were  reduced.  The  results 
after  training,  shown  in  Figure  5-2.,  show  that  the  gate  has  successfully  segmented  the 
switching  FIR  process.  There  is  spurious  switching  near  the  transitions,  and  especially 
during  the  stationary  regime  from  samples  200  to  300.  However,  such  switching  is  brief 
and  represents  a small  fraction  of  the  total  time.  Figure  5-3  clearly  demonstrates  the  ten- 
dency of  the  matched  eigenfilters  to  pick  out  peaks  in  the  signal  spectrum.  Examining  col- 
umn (b),  the  largest  peak  of  FIR  system  II  occurs  at  approximately  0.4  (Nyquist  = 1),  with 
a slightly  smaller  peak  at  0.85.  The  spectra  of  the  first  two  eigenfilters  of  expert  II  are 
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matched  to  the  largest  peak,  effectively  ignoring  the  slightly  smaller  peak.  The  spectra  of 
the  3rd  and  4th  eigenfilters  are  matched  to  the  2nd  largest  peak  only,  and  ignore  the  largest 
peak.  The  same  behavior  is  observed  for  expert  I. 

Figure  5-4  shows  various  convergence  metrics  versus  the  number  of  EM  iterations. 
This  is  important  because  we  mathematically  showed  that  each  PCA  expert  develops  a 
model  for  the  individual  time  series  segment  provided  the  segmentation  is  correct.  In  prac- 
tice, however,  the  segmentation  will  always  be  noisy.  Therefore,  an  important  question  is 
to  find  out  experimentally  how  fast  the  PCA  experts  converge  to  the  true  eigenvectors  as  a 
function  of  the  quality  of  the  segmentation.  A segmentation  quality  metric  for  an  expert 
was  defined  as  the  mean  value  of  either  the  gate  output  or  posterior  probabilities  during 
the  regime  it  was  trying  to  model.  This  metric  has  a value  of  one  for  perfect  segmentation, 
and  0.5  for  a segmentation  chosen  randomly.  Panels  (a)  and  (b)  show  the  evolution  of  the 
segmentation  quality  of  the  experts  during  training,  as  measured  by  the  gate  and  posterior 
probabilities.  Convergence  is  monotonic,  with  the  gate  trailing  the  posterior  probabilities, 
which  have  nearly  converged  by  the  25th  iteration.  This  is  to  be  expected  since  the  poste- 
rior probabilities  act  as  targets  for  the  gate.  Panel  (c)  shows  the  evolution  of  the  angle 
between  each  expert’s  1 st  eigenvector  and  the  true  1 st  eigenvector  for  the  regime  the  expert 
is  trying  to  model.  We  see  that  by  the  25th  iteration,  the  experts’  eigenvectors  have  nearly 
converged  to  the  true  eigenvectors,  even  though  the  segmentation  is  only  approximately 
90%  accurate,  which  shows  that  accurate  models  can  be  obtained  in  practice. 

5.2.2  Signal  Segmentation:  Earthquake  Seismogram 

This  experiment  was  performed  with  a real  world  data  set  consisting  of  a seismogram 
of  a magnitude  6.8  earthquake  that  occurred  near  Jalisco,  Mexico  on  May  1,  1997.  The 
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data  is  the  vertical  long-period  (f=\  second)  seismogram  recorded  by  the  “NEW”  station, 
and  is  shown  in  panel  (a)  of  Figure  5-5.  In  the  same  graph,  the  vertical  lines  represent 
human  expert  scoring1  of  phase  transitions  between  various  direct  and  reflected  primary, 
secondary,  and  surface  waves  of  the  earthquake. 

There  were  five  experts,  each  performing  a dimensionality  reduction  of  60:4.  The  gate 
was  a TDNN  with  two  hidden  layers  and  architecture  20-9-7-5.  The  gate  and  experts  were 
trained  for  1000  epochs  of  the  EM  algorithm.  For  each  M step,  the  gate  was  trained  for  10 
epochs.  The  five  gate  outputs  are  also  shown  in  Figure  5-5.  We  see  that  different  experts 
have  specialized  on  different  temporal  regions  of  the  time  series,  and  that  there  is  some 
correlation  between  the  expert  segmentation  and  the  gate  segmentation.  We  did  not,  how- 
ever, attempt  to  compare  the  gate  output  with  the  human  expert  identification  of  the  seg- 
ments. 

5.2,3  Noise  Reduction 

Consider  the  case  where  a time  series  x{n ) is  corrupted  by  additive  zero-mean  white 
noise.  Let  v(n)  be  the  additive  noise  and  let  the  vector  w be  some  arbitrary  FIR  filter.  The 
signal  to  noise  power  ratio  at  the  output  of  the  filter  is  given  by 
SNR  = E[wT Rw]/ (a1  E[wT w]) , where  a2  is  the  noise  power.  It  is  not  difficult  to  show 
the  well  known  result  that  maximizing  the  signal  to  noise  ratio  implies  that  w is  an  eigen- 
vector of  the  autocorrelation  matrix:  Maxw[SNR]  =>  Rw  = A,m>.  By  symmetry  argu- 
ments, the  SNR  will  also  be  maximized  in  the  reconstructed  signal.  One  can  exploit  this 
property  to  filter  out  additive  noise  from  time  series,  assuming  that  the  time  series  is  sta- 
tionary. If  it  is  not,  the  reconstructed  signal  can  show  dramatic  distortion.  We  can  achieve 


1.  Dr.  Doug  Smith  of  the  University  of  Florida  Geology  Dept,  provided  the  expert  scoring. 
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further  noise  reduction,  through  a technique  given  by  Danilov  et  al.  (1996),  by  noting  that, 
at  the  final  reconstructed  output  vector,  the  components  £,•(«)  and  xt  +j(n  + j)  both  repre- 
sent estimates  of  the  same  point  in  the  time  series,  x(n  — i).  Thus,  we  can  improve  the 
reconstruction  estimate  by  forming  the  average: 


We  first  tested  the  noise  reduction  properties  of  the  algorithm  on  a chirp  signal  with  -6 
dB  additive  white  noise.  The  chirp  illustrates  our  point  very  well  and  is  the  limiting  case  of 
a switching  process  so  our  methodology  can  still  be  applied.  There  were  4 experts,  each 
performing  a dimensionality  reduction  of  24  to  2.  The  gate  was  a TDNN  with  two  hidden 


EM  algorithm.  For  each  M step,  the  gate  was  trained  for  10  epochs.  A reasonable  segmen- 
tation was  achieved  after  only  25  EM  epochs,  followed  by  minor  improvements.  The 
results  after  training  are  shown  in  Figure  5-6.  Panel  (d)  shows  that  the  gate  has  developed 
an  intuitive  segmentation,  with  each  expert  specializing  on  a different  region  of  the  chirp 
signal.  There  is  some  minor  spurious  switching  around  transitions  between  the  regimes 
defined  by  the  gate.  However,  this  is  to  be  expected,  since  a chirp  signal  violates  the 
assumption  of  piecewise  stationarity.  The  frequency  spectra  of  the  first  principal  eigenvec- 
tor of  the  four  experts  are  shown  in  Figure  5-7.  The  graphs  clearly  show  that  each  expert  is 
specializing  on  a different  region  of  the  chirp  signal  in  the  frequency  domain.  We  can 
quantitatively  compare  the  competitive  and  global  PCA  techniques  by  calculating  the 
mean  square  error  between  the  original  signal  and  the  reconstructions.  The  results  are 


p- 1 


(5.1) 


layers  and  architecture  24-6-5-4.  The  gate  and  experts  were  trained  for  200  epochs  of  the 
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shown  in  Table  5-1.  Competitive  PCA  has  a smaller  reconstruction  than  the  global  PC  A 
reconstruction,  independent  of  the  dimensionality  reduction,  P. 


Table  5-1 . Mean  square  reconstruction  error  as  a function  of  the  number  of  principal 

components,  P. 


Competitive  PCA 
P-2  per  Expert 

Global  KLT 

P=  4 

P=  8 

P=12 

0.034 

0.060 

0.047 

0.081 

The  reconstructed  waveforms  are  plotted  in  Figure  5-8.  Competitive  PCA  clearly  pro- 
vides a better  reconstruction  than  the  global  PCA  reconstructions.  Furthermore,  the  global 
PCA  reconstructions  demonstrate  a trade-off  between  the  number  of  principal  components 
and  the  quality  of  the  reconstruction.  If  too  few  principal  components  are  used,  then  the 
reconstruction  exhibits  amplitude  modulation,  due  to  the  bandpass  nature  of  the  optimal 
eigenfilter.  If  too  many  principal  components  are  used,  then  the  reconstruction  matrix 
approaches  the  identity  matrix,  and  both  signal  and  noise  are  reconstructed.  Competitive 
PCA  is  able  to  deal  independently  with  the  noise  reduction  and  the  modeling. 

We  then  tested  the  noise  reduction  properties  of  the  algorithm  on  a digital  recording  (fs 
= 11127  Hz)  of  a violin  playing  four  notes:  C-A#-A-A#,  where  A corresponded  to  stan- 
dard concert  pitch  (440  Hz).  Observe  that  the  2nd  and  4th  notes  are  nominally  the  same 
pitch.  Since  the  violin  has  no  frets,  two  nominally  identical  notes  can  still  differ  slightly  in 
fundamental  frequency,  and  transitions  between  notes  may  occur  gradually.  To  the  digital 
recording,  we  added  white  Gaussian  noise  for  a signal  to  noise  ratio  of  0 dB,  which  then 
became  the  input  to  our  algorithm.  There  were  3 experts,  each  fed  by  a tap-delay  line  with 
100  taps.  The  gate  was  a TDNN  with  two  hidden  layers  and  architecture  100-6-5-3. 
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Because  of  the  size  of  the  data  set,  we  used  a slightly  different  training  procedure.  Each 
iteration  corresponded  to  5 epochs  for  training  the  experts  and  then  200  epochs  for  training 
the  gate.  The  initial  number  of  principal  components  was  P-2,  which  was  annealed  up  to 
P = 1 0,  by  adding  2 principal  components  at  each  iteration,  for  a total  of  5 iterations.  A 
reasonable  segmentation  was  achieved  after  only  2 iterations. 

Figure  5-9  shows  the  results  after  training.  Qualitatively,  we  see  in  panel  (c)  that  the 
reconstructed  time  series  has  removed  much  of  the  noise  while  retaining  the  envelope  of 
the  original  signal.  To  the  ear,  there  is  a vast  reduction  in  the  noise  level,  but  tonal  quality 
was  changed  such  that  the  signal  was  not  readily  identifiable  as  a violin  by  unbiased  listen- 
ers. By  subtracting  the  original  signal  from  the  reconstructed  version,  we  estimated  a sig- 
nal to  noise  ratio  in  the  reconstructed  signal  of  47  dB.  Panels  (d),  (e)  and  (f)  show  the  gate 
output  for  the  3 experts  and  shows  that  the  gate  successfully  segmented  the  signal.  Fur- 
thermore, the  gate  output  corresponding  to  the  2nd  expert  properly  indicates  that  the  2 nd 
and  4th  notes  are  the  same.  This  gives  us  confidence  that  the  experts’  eigenfilters  do  indeed 
represent  the  stationary  regimes  of  the  time  series.  Figure  5-10  compares  the  periodogram 
of  the  stationary  regions  of  the  original  signal,  as  determined  by  the  gate  segmentation, 
with  the  corresponding  experts’  eigenfilters  frequency  response.  Only  the  1st,  3rd,  5th  7th, 
and  9 eigenfilters  are  shown,  since  the  even  eigenfilters’  frequency  response  are  nearly 
identical  to  the  odd  ones.  Each  expert’s  1st  eigenfilter  is  centered  around  one  of  the  funda- 
mental frequencies  of  the  three  pitches  present  in  the  note  sequence.  This  is  in  spite  of  the 
fact  that  the  eigenfilters  have  a bandwidth,  D'\  of  approximately  110  Hz,  which  is  larger 


than  the  83  Hz  difference  between  the  A and  C notes. 
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Table  5-2.  Peak  frequencies  (Hz)  of  the  signal  periodogram. 


Regime 
corresponding 
to  note 

Harmonic 

Funda- 

mental 

1st 

2nd 

3rd 

4th 

C 

535.3 

1070.6 

1599.9 

2136.7 

2666.6 

A# 

469.1 

938.3 

1408.9 

1878.1 

2347.2 

A 

445.1 

887.2 

1333.7 

1777.3 

2222.4 

Table  5-3.  Peak  frequencies  (Hz)  of  the  expert  and  global  eigenfilters. 


Expert 

corresponding 
to  note 

Eigenfilter  Order 

1st 

3rd 

5 th 

7th 

9th 

C 

534.1 

1073.8 

2136.4 

1607.9 

467.3 

612.0 

A# 

467.3 

2331.1 

2815.1 

2342.2 

2809.6 

929.1 

1869.3 

A 

450.6 

1774.8 

1335.2 

3115.6 

3555.1 

Global 

489.6 

417.3 

567.5 

2331.1 

2820.7 

2342.2 

2809.6 

923.54 

1079.3 

Table  5-2  shows  the  peak  frequencies  of  the  original  time  series  as  measured  from  the 
global  periodogram.  Table  5-3  shows  the  peak  frequencies  of  the  expert  and  global  eigen- 
filters for  the  noisy  time  series.By  comparing  the  two  tables,  we  see  that  the  peak  frequen- 
cies of  the  1st  eigenfilters  of  the  experts  do  indeed  correspond  to  the  fundamental 
frequencies  of  the  three  notes.  The  double  entries  in  some  of  the  cells  indicate  that  some  of 
the  eigenfilters  exhibit  two  peaks  of  comparable  power.  For  example,  the  9th  eigenfilter  of 
the  note  C expert  has  two  peaks  at  467.3  and  612.0  Hz.  This  split  is  not  caused  by  interfer- 
ence from  the  other  fundamentals,  since  the  gate  segmentation  is  nearly  perfect.  Rather,  it 
is  caused  by  residual  energy  around  the  note  C fundamental  that  the  1st  eigenfilter  did  not 
match.  The  splitting  occurs  because  the  orthogonality  constraint  means  that  higher  order 
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eigenfilters  have  zeros  near  the  peak  locations  of  lower  order  eigenfilters.  The  global 
eigenfilters,  on  the  other  hand,  do  exhibit  interference  across  stationary  regions.  For  exam- 
ple, the  1st  global  eigenfilter  has  a peak  at  489.6  Hz,  which  is  close  to  the  average  of  the 
three  fundamental  peaks. 


Figure  5-2.  Switching  regime  simulation:  (a)  switching  FIR  signal;  (b)  expert  squared 
reconstruction  error;  (c)  expert  log  probability;  (d)  gate  output;  (e)  posterior  probability. 


69 


Figure  5-3.  Switching  regime  simulation.  Power  spectrums  (dB)  of  source  and  expert 
matched  eigenfilters:  (a)  source  I and  expert  I eigenfilters;  (b)  source  II  and  expert  II 

eigenfilters. 
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(a) 


(b) 


(c) 


Figure  5-4.  Switching  regime  simulation.  Convergence  properties  for  expert  I (solid)  and 
expert  II  (dashed):  (a)  gate  segmentation  quality  vs.  iteration  number;  (b)  posterior  proba- 
bilities segmentation  quality  vs.  iteration  number;  (c)  angle  (degrees)  between  expert’s  1st 
eigenvector  and  true  1 st  eigenvector  vs.  iteration  number. 
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Figure  5-5.  Seismogram  segmentation:  (a)  seismograph  and  human  expert  scoring; 
(b),(c),(d),(e),  and  (f)  gate  outputs  of  experts  1,2, 3, 4,  and  5,  respectively 
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Figure  5-6.  Chirp  simulation:  (a)  corrupted  signal;  (b)  expert  squared  reconstruction  error; 
(c)  expert  log  probability;  (d)  gate  output;  (e)  posterior  probability. 
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Figure  5-7.  Chirp  simulation:  (a,b,c,d)  frequency  response  of  1st  principal  eigenfilter  for 

experts  1,2,3,  and  4,  respectively. 
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Figure  5-8.  Chirp  simulation:  (a)  original  signal;  (b)  corrupted  signal;  (c)  competitive  PCA 
reconstruction;  (d,e,f)  global  KLT  reconstruction  P=4,8,12. 
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Figure  5-9.  Violin:  (a)  original  signal;  (b)  corrupted  signal;  (c)  competitive  PCA  recon- 
struction; (d),  (e),  (f)  gate  output  corresponding  to  the  experts  for  notes  C,  A#,  and  A, 

respectively. 
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Figure  5-10.  Violin:  (a),(c),(e)  original  signal  periodogram  (linear)  for  notes  C,  A#,  and  A, 
respectively,  vs.  frequency  in  Hz;  (b),(d),(f)  corresponding  expert  eigenfilter  frequency 

response  (linear)  vs.  frequency  (Hz). 
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5.3  Temporal  PC  A 

In  order  to  fully  understand  the  simulation  results  of  competitive  temporal  PCA,  we 
must  understand  temporal  PCA.  This  can  be  best  accomplished  by  understanding  the  fre- 
quency domain  behavior  of  the  eigenfilters.  The  discrete  time  Fourier  transform  (DTFT) 
of  the  eigenvalue  equation  results  in  an  integral  equation 

1/2 

Ru  = \uo  J (5.2) 

J sin  [n(J -J  )J 

-1/2 

where  S(f)  is  the  power  spectrum  corresponding  to  the  autocorrelation  function,  r{ri), 

1/2 

r(n)  = \ S(f)ei2nn/df.  (5.3) 

-1/2 

The  eigenfunction  \j /(/)  is  the  DTFT  of  the  eigenvector  u 

N- 1 

, v"1  / \ — (TV — 1 )/2]  /c  >i\ 

\\i(f)  = e 2]  u{n)e  • (5.4) 

n = 0 

Since  u is  either  symmetric  or  anti-symmetric,  the  phase  term  (N- 1 )/2  and  the  constant  e 
are  introduced  to  ensure  that  the  eigenfunction  is  real.  The  constant  s is  defined  to  be  1 or 
i,  depending  on  whether  u is  symmetric  or  anti-symmetric,  respectively. 

Equation  (5.2)  in  the  frequency  domain  is  a homogenous  Fredholm  equation  of  the 
second  kind  with  a Dirichlet  kernel  and  weighting  function  S(f).  When  S(J)  is  a bandlim- 
ited  low  pass  filter,  such  that  S(/)=l  for  \f\<W,  and  0 elsewhere,  the  resulting  equation 
defines  the  discrete  prolate  spheroidal  wave  functions  (DPSWF)  of  Slepian  (1978) 
w 

} N,  W)df  = A t(N,  W)Uk(f,  N,W)  k — 0...N—  1 . (5.5) 

-w 
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Properties  of  the  DPSWF,  Uk(fJV,W),  and  their  eigenvalues,  A k{N,W),  are  given  in  the 
appendix.  We  will  now  propose  and  demonstrate  a close  relationship  between  the  fre- 
quency domain  solutions  of  (5.2)  and  the  DPSWF  for  the  medium  window  case.  The  large 
and  small  window  solutions  of  (5.2)  have  been  studied  from  various  perspectives,  which 
we  briefly  summarize  below. 

5.3.1  The  Large  Window  Solution 

As  N — » oo , the  Dirichlet  function  becomes  a Dirac  delta  function  and  (5.2)  becomes 

M '(J)  = V(/)S(/)  (5.6) 

the  solution  of  which  is 

v|/*(/)  = 8 (/--/*),  **  = %).  (5.7) 

That  is,  the  eigenfilters  form  a fixed  Fourier  basis,  and  the  eigenvalues  become  the  power 
spectrum.  Since  the  autocorrelation  matrix  is  real  and  symmetric,  the  eigenvalues  are  real, 
and  we  can  enforce  the  constraint  that  the  eigenfilters  be  real.  Then,  each  eigenvalue  will 
be  doubly  degenerate,  associated  with  two  real  sinusoidal  eigenvectors  separated  in  phase 
by  90  degrees.  Assuming  that  the  eigenvalues  have  been  ordered  from  largest  to  smallest, 
so  that  A-o  is  the  largest  eigenvalue,  then  A.0  is  the  maximum  value  of  the  power  spectrum: 
X0  = maxy{5(/)] . 

5.3.2  The  Small  Window  Solution 

Casdagli  et  al.  (1991)  showed  that  when  the  autocorrelation  matrix  is  viewed  as  a sam- 
pled version  of  the  continuous  time  autocorrelation  function,  the  time  domain  eigenvectors 
can  be  approximated  by  the  discrete  Legendre  polynomials  when  the  window  width,  N xs, 
is  small,  where  r5  is  the  sampling  period.  For  the  purely  discrete  case,  as  A becomes  small, 
the  time-bandwidth  product  law  dictates  that  the  bandwidth  of  the  eigenfilters  will  become 


79 


large.  The  eigenfilters  then  form  a fixed  basis  in  the  sense  that  they  become  increasingly 
insensitive  to  the  specific  spectral  properties  of  the  signal.  It  is  interesting  to  note  that  in 
the  limit  as  the  bandwidth  approaches  the  extremes  W—>0  and  W—>Z2,  the  DPSWF, 
become  the  Legendre  polynomials. 


5.4  Motivation:  Matched  Filter  Approach 

Thomson  (1982)  first  suggested  that  the  DPSWF  might  be  a suitable  basis  in  which  to 
expand  the  eigenfunctions  in  the  frequency  domain.  We  now  consider  an  energy  maximi- 
zation approach  to  solving  the  eigenvalue  equation  in  the  frequency  domain  which  pre- 
sents a strong  argument  for  the  use  of  the  DPSWF.  It  is  well  known  that  maximizing  or 
minimizing 


1/2 

E = uRu  = j W(f)\2S(f)df  (5.8) 

-1/2 

subject  to  the  constraint 


1/2 

UTU  = j |vK/)lV=  1 (5.9) 

-1/2 

leads  to  the  eigenvalue  equation  (5.2).  Equation  (5.8)  in  the  frequency  domain  shows  that 
the  eigenfunctions  act  as  matched  filters.  The  frequency  domain  solution  to  maximizing  E 
subject  to  (5.9)  is  the  same  as  the  infinitely  large  window  solution;  namely,  a Dirac  delta 
function,  v|/(/)  = 8(f— /max),  centered  around  the  peak  value  of  the  signal  spectrum, 
/max  = maxy{5(/)] , so  that  E simply  picks  out  the  largest  value  of  S(J).  However,  since  u 
is  index  limited  in  the  time  domain,  it  cannot  have  infinite  concentration  in  the  frequency 
domain.  This  means  that  the  constraint  that  u is  index  limited  is  not  embodied  in  the  fre- 
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quency  domain  eigenvalue  equation.  The  Dirichlet  kernel  arises  because  of  the  truncation 
of  the  infinite  sequence  r(n)  but  the  filter  u is  not  the  truncation  of  a longer  sequence.  Still, 
this  suggests  that  the  solution  must  be  related  to  the  index  limited  sequence  that  has  the 
largest  concentration  of  energy  in  the  frequency  domain.  This  is,  in  fact,  a defining  prop- 
erty of  the  DPSWF  with  the  largest  eigenvalue.  Therefore,  we  propose  that  the  solution  to 
the  maximization  of  (5.8)  is,  to  a high  degree  of  approximation,  a frequency  shifted 
DPSWF 

1/2 

Em„  * maxA  w,  j | W~4  N>  *0  fs(JW-  (5-">) 

-1/2 

Denote  the  optimal  values  by  /0  and  fV0.  Since  the  DPSWF’s  are  real,  this  implies 
that  the  solution  to  (5.2)  corresponding  to  the  largest  eigenvalue  is  approximately 

Vo(/>*  U0(f-f0,N,W0).  (5.11) 

Since  both  v| ;0(/)  and  U0(f)  have  all  their  zeros  on  the  unit  circle  (Makhoul,  1981),  the 
frequency  shifting  must  be  done  so  as  to  preserve  this  property.  Also,  note  that  since  U0  is 
symmetric,  if  the  spectrum  is  also  symmetric  about  its  local  maximums,  then  /0  will  coin- 
cide with  one  of  these  local  maximums. 

We  now  demonstrate  some  of  these  properties  for  a spectrum  from  the  switching  FIR 
simulation  of  the  first  part  of  this  chapter.  Figure  5-11  (a)  shows  the  original  spectrum. 
Panel  (b)  shows  the  spectrum  of  the  first  eigenfilter  and  panel  (c)  shows  the  spectrum  of 
the  optimally  shifted  DPSWF.  We  see  that  there  is  a remarkable  similarity  in  the  two  spec- 
tra. This  is  also  confirmed  in  the  time  domain,  shown  in  Figure  5-12.  Again,  there  is  a 
remarkable  similarity  between  the  1st  eigenvector  and  the  optimally  frequency  shifted 


DPSS. 
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Figure  5-11.  Power  spectrums  vs.  frequency  (digital):  (a)  original  signal;  (b)  1st  eigenfil- 

ter;  (c)  optimal  frequency  shifted  DPSWF. 


Figure  5-12.  Signal  1st  eigenfilter  (solid)  and  optimal  frequency  shifted  DPSWF  (hollow) 

vs.  time  (samples). 
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Figure  5-13.  (a)  Matched  energy  vs.  IF for  optimal  frequency  offset  /o ; (b)  Matched  energy 


vs.  f0  for  optimal  half-bandwidth  W. 

Figure  5-13  shows  the  matched  energy  as  a function  of  the  frequency  offset  (a)  and 
half-bandwidth  parameter  (b)  of  the  DPSWF.  Note  that  the  optimal  DPSWF  is  relatively 
insensitive  to  the  half-bandwidth  parameter,  W,  and  peaks  at  a value  close  to  the  optimal 
time-bandwidth  product,  2 WN  » 2 • 0.02  • 20  = 0.8 . This  implies  that  the  value  of  W can 
be  chosen  based  on  the  dimensionality,  N,  of  the  tapped  delay  line,  irrespective  of  the  par- 
ticular signal  spectrum.  Note  also  that  the  optimal  matched  energy  is  very  sensitive  to  the 
frequency  offset,  a desireable  property. 

5.5  Separable  Kernel  Approach 

The  previous  development  presented  a strong  motivation  for  considering  the  DPSWF 
as  a suitable  basis  for  the  eigenfunctions.  We  now  proceed  to  find  such  a basis.  A closed 
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form  algebraic  solution  of  a homogenous  Fredholm  equation  of  the  second  kind  can  be 
found  whenever  the  kernel  is  separable  or  degenerate,  meaning  expressible  as  a sum  of 
products  of  the  form 

N-l 

sm[Nn(f-f  )]  _ y q (5  j2\ 

sin [71  (/—/')]  ^ UnViUnV  )•  1 ] 

n = 0 

Substitution  back  into  the  frequency  domain  eigenvalue  equation  leads  to 

N- 1 1/2  N- 1 

X Q„(f)  I Q„(f')S(f')W)df  = X c„Qn(J)  = XV(/)  (5.13) 

fl  = 0 -1/2  n = 0 

where  the  integral  has  been  reduced  to  the  constant  c„.  Note  that  (5.13)  represents  an 
expansion  of  the  eigenfunctions  in  terms  of  the  basis  Qn(f) 

N-  1 

vw  = ^ £ c„e„w  • (5.i4) 

n = 0 

We  can  solve  for  the  expansion  coefficients  by  multiplying  both  sides  of  (5.13)  by 
and  integrating: 

yv-i  1/2  n-\ 

S c»  I £ am„c„  = Xcm.  (5.15) 

« = 0 -1/2  n = 0 

Regarding  the  expansion  coefficients,  {cn},  of  the  eigenfunction  as  a vector,  they  are 
themselves  the  solution  of  an  eigenvalue  equation  of  a matrix  A 


Ac  = KC 


(5.16) 


whose  elements  are  given  by 
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1/2 

«™  = 1 Qmm„v)sv)df.  (5.i7) 

-1/2 

There  will  be  N solutions  to  (5.16),  each  producing  a different  solution  to  (5.2).  There 
are  two  known  ways  to  separate  the  Dirichlet  kernel.  One  is  through  the  use  of  a Fourier 
basis 


sin[TV7i  (/-/')] 
sin  [ ft  (/—/')] 


N-  1 


I 


ein(N-  1 -2n)fe~in(N-  1 -2n)f' 


n = 0 


(5.18) 


Use  of  (5. 1 8)  in  conjunction  with  the  previous  development  leads  to  A =>  R and  thus  back 
to  the  time  domain  eigenvalue  equation  (5.2)  and  thus  yields  no  new  insight.  However,  the 
Dirichlet  kernel  is  also  separable  in  terms  of  the  DPSWF 


N-  1 

t^j1  = z Un<f-fo’N’  ’nu.v-f, IT)-  (5.19) 

n = 0 

where yj)  is  an  arbitrary  frequency  offset.  Use  of  (5.19)  in  conjunction  with  the  previous 
development  leads  naturally  to  an  expansion  of  the  kth  eigenfunction  in  terms  of  the 
DPSWF, 


N—  1 

= j-  Z *0  * = 0 -N-  1 (5.20) 

n = 0 

• th  th 

where  {c^)n  is  the  n component  of  the  k vector  solution  of  the  eigenvalue  equation, 
Ack  = ^kck , with  A having  elements 

1/2 

«„„=  \ Um(f-UN,W)U„(f-fa,N,W)S(f)df.  (5.21) 


-1/2 
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Thus,  we  see  that  an  expansion  of  the  eigenfunctions  of  (5.2)  in  terms  of  the  DPSWF 
of  (5.5)  falls  out  naturally  as  a result  of  the  separability  of  the  Dirichlet  kernel.  We  now 
turn  to  the  question  of  choosing  the  frequency  offset.  Based  on  the  matched  filter 
approach,  it  seems  natural  to  choose to  coincide  with  the  local  peak  of  the  signal  spec- 
trum with  the  largest  energy,  and  W to  be  the  effective  bandwidth  of  the  peak.  This  will 
ensure  that  energy  is  concentrated  along  the  main  diagonal  of  A,  and  that  the  principal 
eigenfunction  will  be  expressed  primarily  in  terms  of  the  principal  DPSWF, 
v|/0(/)  « U0(f—fo,  N,  IV),  and  that  the  other  expansion  coefficients  will  be  nearly  zero. 

5.6  Expansion  in  Terms  of  Multiple  Frequency  Shifted  DPSWF 
We  now  present  a generalization  of  Thomson’s  proposed  DPSWF  basis.  We  start  with 
(5.13),  an  expansion  of  the  eigenfunctions  in  terms  of  a basis 

N—  1 

V(/)  = X cnQnU)  (5.22) 

n = 0 

where  the  basis  is  a set  of  frequency  shifted  DPSWF.  We  divide  the  digital  frequencies  into 
K frequency  bands,  each  with  a center  frequency  of fk.  Each  band  may  have  a different 
bandwidth,  specified  by  the  half-bandwidth  parameter  Wk.  For  each  center  frequency,  we 
use  only  the  first  2 NWk  frequency  shifted  DPSWF,  since  only  those  have  eigenvalues  close 
to  one.  The  basis  is  then  given  by: 

&,</)  = »*)  * I o";f2wt-i)  (5'23) 

The  index  m is  the  cumulative  count  (minus  one)  of  the  number  of  DPSWF  employed,  of 


which  the  total  number  is 
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K- 1 K- 1 

£ 2 NWk  = N'£j2Wk  = N . (5.24) 

/t  = 0 k = 0 

as  required  in  order  that  the  number  of  equations  equal  the  number  of  eigenfunctions.  If 
the  quantity  2NWk  is  not  an  integer,  it  can  be  rounded  up  or  down  as  required,  as  long  as 
the  total  number  of  basis  functions  is  N.  This  division  of  the  digital  frequency  axis  is 
shown  in  Figure  5-14.  (a).  Note  that  since  the  power  spectrum  is  an  even  function,  we 
must  have  fk  = f^.\.k  and  Wk  = WK.x_k.  Thomson’s  proposed  basis  is  a subset  of  this  where 
all  the  bandwidths  are  equal,  and  is  shown  in  Figure  5-14.  (b). 


Figure  5-14.  Dividing  the  digital  frequency  axis:  (a)  arbitrary  bandwidth,  (b)  Thomson’s 

uniform  bandwidth. 


It  is  important  to  note  that  this  basis,  while  orthogonal  within  bands,  is  not  orthogonal 
between  bands.  However,  as  long  as  the  basis  functions  are  linearly  independent,  they  will 
be  complete  in  the  space  of  Fourier  transforms  of  index  limited  sequences.  The  linear 
independence  of  the  basis  functions  can  theoretically  be  tested  by  verifying  that  the  Wron- 


skian  is  non-zero 
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Q,if)  • 

• Qn-  i c/) 

det 

Q'oif)  • 

• Q'n-xU) 

S 

• i 

O) 

(5.25) 


Unfortunately,  since  there  is  no  analytical  expression  for  the  DPSWF,  it  is  impossible  to 
evaluate  the  Wronskian  for  the  general  case. 

We  now  proceed  to  solve  for  the  expansion  coefficients.  Substituting  (5.22)  into  the 
frequency  domain  eigenvalue  equation  (5.2)  there  results 


N- 1 1/2 


N-  1 


I cn  J = X X (5.26) 

71  = 0 —1/2  * = o 

Multiplying  both  sides  by  Qn'(f)  and  integrating  over  the  principal  domain  of  the  shifted 
DPSWF,  the  left  hand  side  of  (5.25)  becomes 


N-  1 


1/2 


u.  J 


sin  [Me  (/—/')] 
sin  [ti  (/-/')] 


QnWf 


ttz  = 0 


-1/2 


fv-Wv 


S(f')Qn(f')df 


(5.27) 


The  integral  in  brackets  can  be  reduced  by  using  the  defining  equation  for  the  DPSWF 
(5.5),  in  which  case  (5.26)  becomes 

N- 1 1/2  N- 1 /*•+»*• 

<V  Z cm  j Qm.(f')Qm(f)S<fW  = 1^'.  j (5.28) 

771  = 0 -1  /2  777  = 0 

where  it  is  to  be  understood  that  the  eigenvalue  associated  with  the  mth  basis  function  is  a 
DPSWF  eigenvalue  that  depends  on  both  the  particular  frequency  band  index,  k,  and  the 
DPSWF  index  / within  the  band:  am  = A j(N,  Wk) . We  can  write  (5.28)  as 
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N— 1 N-\ 

o , V a , c = X V b , c (5.29) 

w m m m / . w m m 

m = 0 m = 0 

or  in  matrix  form,  as  a generalized  eigenvalue  equation 

Z,4c  = IBc  (5.30) 

where  c is  the  vector  of  expansion  coefficients  of  the  eigenfunction,  I is  a diagonal  matrix 
containing  the  DPSWF  eigenvalues,  and  A and  B are  matrices  with  elements,  respectively, 

1/2 

v.  = J Q*mmv)svw  <5.31) 

-1/2 

/*■  + wk. 

*„■„=  1 Qm.(f)Qmm ■ (5.32) 

fr- Vi- 
la general,  there  will  be  N solutions  to  (5.30),  each  producing  a unique  solution  to  (5.2). 

Our  goal  now  is  to  specify  a basis  for  which  the  eigenfunctions  are  expressible  in 
terms  of  the  minimum  number  of  DPSWF.  That  is,  we  desire  that  each  vector  of  expansion 
coefficients,  cn,  be  dominated  by  a single  coefficient.  We  saw  from  the  matched  filter 
approach  that  the  eigenfunction  with  the  largest  eigenvalue  was  approximately  expressible 
in  terms  of  a single  DPSWF  when  the  DPSWF  was  frequency  shifted  to  coincide  with  the 
local  peak  of  the  signal  spectrum  with  the  largest  energy.  Based  on  that  result,  we  propose 
a basis  of  frequency  shifted  DPSWF  where  the  K center  frequencies  and  bandwidths  are 
chosen  to  align  with  the  local  peaks  of  the  signal  spectrum.  This  should  concentrate  most 
of  the  energy  of  the  expansion  in  as  few  coefficients  as  possible,  and  result  in  the  closest 
possible  relationship  between  the  eigenfunctions  and  the  DPSWF. 
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5.1  Conclusion 

We’ve  seen  how  the  Competitive  PC  A algorithm  from  the  previous  chapter  can  be 
adapted  to  time  series  by  simply  adding  a tapped  delay  line  at  the  input.  The  algorithm  was 
then  able  to  segment  time  series,  but  not  as  well  as  the  classical  GLR  algorithm  perform- 
ing prediction.  Therefore,  we  conclude  that  its  true  strength  lies  in  its  potential  for  noise 
reduction  and  optimal  encoding. 

In  order  to  understand  the  simulation  results,  we  then  explored  the  properties  of  an 
eigenfilter  in  isolation.  To  this  end,  we  examined  the  eigenvalue  equation  in  both  the  time 
and  frequency  domain.  We  showed  that  solving  the  eigenvalue  equation  in  the  time 
domain  is  equivalent  to  assuming  a Fourier  basis  in  the  frequency  domain.  We  also 
showed  that  there  is  another  natural  basis  for  solving  the  equations;  namely,  the  discrete 
prolate  spheroidal  wave  functions  (DPSWF).  We  then  proposed  conditions  under  which 
there  is  a close  correspondence  between  the  eigenfilters  and  the  DPSWF. 


CHAPTER  6 

MEMORY  IN  GATED  COMPETITIVE  SYSTEMS 
6.1  Introduction 

We’ve  seen  how  the  output  of  the  gate  effectively  segments  a signal.  Unfortunately, 
due  to  several  factors,  the  segmentation  can  exhibit  spurious  switching.  Such  switching 
may  be  due  to  genuine  overlap  in  the  return  maps  of  the  various  dynamical  regimes  or,  in 
the  language  of  non-linear  dynamics,  the  manifolds  of  the  various  dynamical  regimes  may 
intersect  in  phase  space.  Additionally,  measurement  noise  can  “smear”  manifolds  that 
ordinarily  would  never  intersect. 

One  way  to  reduce  such  spurious  switching  is  to  add  memory  to  our  gated  competitive 
system.  Through  the  use  of  memory,  an  expert  that  has  performed  well  in  the  recent  past 
can  be  given  an  advantage  in  the  present.  In  general,  memory  can  be  added  either  to  the 
competitive  framework  (gate),  the  competitive  mechanism  (the  experts’  performance  mea- 
sure), or  the  experts  themselves.  We  have  already  seen  one  architecture  that  adds  memory 
to  the  gate;  namely,  the  Annealed  Competition  of  Experts  (ACE).  In  this  chapter  we  will 
consider  other  ways  to  add  memory  to  a gated  competitive  system. 

6.2  Adding  Memory  to  a Cost  Function 

We  now  present  a recursive  mean  square  error  cost  function,  and  show  that  it  leads  to 
momentum  learning.  The  standard  mean  square  error  criteria  for  a single  adaptive  system 
is  a summation  of  the  squared  errors  over  the  entire  data  set: 
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N 

C=  £ e2(n ) (6.1) 

n = 1 

where  the  instantaneous  error  e(n ) is  the  difference  between  the  system  output  and  the 
desired  output.  Now  consider  a cost  function  that  uses  a recursive  estimate  of  the  local 
mean  square  error 


N 

c=  X 8(")  (6-2) 

n = 1 

where  s («)  is  the  local  mean  square  error  calculated  using  a recursive  filter 


e(«)  = Xe2(n)  + (1  — X)e(n—  1)  0 < A,  < 1 . (6.3) 

The  parameter  A.  controls  the  memory  depth  of  the  filter.  There  are  several  ways  to 
define  the  effective  memory  of  a recursive  system,  but  the  simplest  is  to  calculate  the  aver- 
age of  the  temporal  index  weighted  by  the  impulse  response  (Principe  et  al.,  1993). 
According  to  this  definition,  it  is  easy  to  show  that  the  effective  memory  depth  of  (6.3),  in 
samples,  is  given  by  A." 1 . 

Taking  the  expected  value  of  both  sides  of  (6.3)  and  simplifying  shows  that  it  is  an 

2 

unbiased  estimate  of  the  mean  square  error:  £[e]  = E[e  ] . Now  consider  the  gradient  of 
the  cost  function  with  respect  to  an  adaptable  weight 

N 


dC_ 

dw 


(6.4) 


n = 1 

where  the  instantaneous  gradient  is  given  by 


As(B)  = 2X^(n)  + (l-X)^(n-l). 


(6.5) 
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For  an  on-line  adaptive  system,  the  weight  change  is  proportional  to  the  instantaneous 
gradient.  Viewed  in  this  way,  (6.5)  is  equivalent  to  learning  with  momentum.  In  the  neural 
network  literature,  momentum  learning  is  presented  as  an  ad-hoc  way  of  speeding  up 
learning.  Here,  we  see  it  is  the  natural  consequence  of  a recursive  mean  square  error  cost 
function,  which  we  will  now  employ  in  the  following  models. 


6.3  Neighborhood  Map  of  Competing  Predictors 
Our  implementation  is  shown  in  Figure  6-1 . A set  of  predictors  works  operate  in  paral- 
lel on  the  input  time  series.  The  winning  predictor  is  the  one  with  the  smallest  local  mean 
squared  error 

winner(t)  = argmin([s;(r)]  (6.6) 

where  the  local  mean  square  error  is  computed  using  the  recursive  estimate  of  (6.3)  for 
each  expert 

£,(«)  = Xe2i{n)  + (\-'k)zi{n-\)  0 < X < 1 (6.7) 

where  et  is  the  instantaneous  squared  error  of  the  ith  predictor.  The  memory  term,  A.,  is 
identical  for  all  experts,  but  can  be  manually  adjusted  during  training.  The  gating  function, 
which  moderates  the  learning  rate  of  the  predictors,  is  determined  by  the  distance  from  the 
winning  predictor  to  the  other  predictors 


giU)  = 


f <A0) 

^ 2o\ty 
e 


(6.8) 


where  i is  the  predictor  to  be  updated,  j is  the  winning  predictor,  dn  is  the  neighborhood 
distance  from  predictor  i to  predictor  j,  and  a is  an  annealing  parameter  which  controls  the 
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Figure  6-1.  Neighborhood  map  of  competing  predictors. 


neighborhood  width.  This  results  in  the  model  shown  in  Figure  6-1 . The  signal  flow  graph 
is  shown  for  only  one  predictor. 

In  exact  analogy  with  training  a Kohonen  map,  both  the  neighborhood  width  and  the 
global  learning  rate  are  annealed  during  training  according  to  an  exponentially  decreasing 
schedule 

-t  -t 

o{t)  = a0ex  p(0  = PoeT  (6-9) 

where  x is  the  annealing  rate.  The  overall  learning  rate  for  the  ith  predictor  is  thus  given  by 

*1,(0  = H(0'g/(0-  (6-10) 

6.3.1  Simulation  I:  Switching  FIR  Process 

In  order  to  better  understand  the  algorithm,  we  used  linear  predictors  trained  with  the 
normalized  LMS  algorithm  for  the  following  simulation.  For  viewing  convenience,  we 
used  a one  dimensional  continuous  map,  where  the  last  node  is  mapped  as  a neighbor  to 
the  first  node.  Training  was  conducted  for  50  passes  through  the  entire  time  series,  using 
an  annealing  rate  of  x = 5 passes.  The  time  series  was  a switching  FIR  process.  It  was  gen- 
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Figure  6-2.  Switching  FIR  process:  (a)  time  series;  (b)  winning  predictors  after  training, 
erated  by  a 6th  order  normalized  FIR  filter,  driven  by  zero  mean  Gaussian  noise  of  unit 
variance,  whose  coefficients  were  random  and  change  every  200  samples.  The  time  series, 
which  consists  of  a total  of  8 stationary  regions,  is  shown  in  Figure  6-2. 

For  the  predictors,  we  used  eight  8th  order  predictors.  The  memory  depth  of  the  error 
was  25  samples.  The  initial  neighborhood  and  learning  rate  were  16  and  1,  respectively. 
The  winning  predictors  after  training  are  shown  in  Figure  6-2  (b).  The  time  series  was 
fairly  successfully  segmented,  except  for  a longer  than  expected  winning  run  for  predictor 
8,  and  minor  switching  errors  near  samples  200  and  475.  Those  latter  errors  correlate  well 
with  outliers  in  the  time  series,  which  temporarily  degrades  the  performance  of  the  win- 
ning predictor.  However,  in  both  cases,  recovery  occurred  rapidly. 
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From  the  FIR  coefficients  that  produced  the  series,  we  can  define  a similarity  coeffi- 
cient, which  is  tabulated  in  Table  6-1. 

|cos0j  = \Wr  Wj\.  (6.11) 


Table  6-1 . Distance  between  the  eight  stationary  regions. 


Reg- 

ion 

I 

II 

III 

IV 

V 

VI 

VII 

VIII 

I 

1 

II 

.38 

1 

III 

.36 

.61 

1 

IV 

.86 

.51 

.40 

1 

V 

.75 

.23 

.17 

.79 

1 

VI 

.07 

.38 

.03 

.40 

.59 

1 

VII 

.25 

.46 

.63 

.25 

.38 

.42 

1 

VIII 

.65 

.59 

.69 

.43 

.39 

.01 

.57 

1 

Region  I refers  to  samples  1-200,  region  II  refers  to  samples  201-400,  and  so  on.  Using 
this  metric,  the  two  regions  with  the  highest  similarity  (0.86)  were  sample  regions  1-200 
and  601-800.  From  the  neighborhood  map,  we  indeed  find  that  neighboring  predictors  6 
and  7 won  those  regions.  Likewise,  the  two  regions  with  the  second  highest  similarity 
(0.79)  were  sample  regions  601-800  and  801-1000,  won  by  neighboring  predictors  7 and 
8.  Alternatively,  we  find  that  the  two  regions  the  most  dissimilar  (0.01)  were  sample 
regions  1001-1200  and  1401-1600.  Indeed,  the  winners  of  those  two  regions  are  predictors 
1 and  5,  as  far  apart  as  the  map  will  allow.  Overall,  there  is  a very  high  correlation  between 
the  similarity  of  two  regions  and  their  distance  on  the  neighborhood  map. 
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Figure  6-3.  Chirp  example:  (a)  time  series;  (b)  winning  predictors  after  training. 


6.3.2  Simulation  II:  A Non-stationarv  Time  Series 

For  this  simulation  the  time  series  consisted  of  a chirp  signal  of  500  points,  whose 
angular  frequency  varied  from  0 to  1/8  of  the  Nyquist  frequency.  The  signal  is  shown  in 
Figure  6-3  (a).  For  the  predictors,  we  used  ten  6th  order  predictors.  The  memory  depth  of 
the  error  was  2.5  samples.  The  initial  neighborhood  and  learning  rate  were  10  and  1, 
respectively.  The  winning  predictors  after  training  are  shown  in  Figure  6-3  (b). 

We  see  that  the  non-stationary  time  series  was  segmented  in  an  intuitive  way.  Except 
for  the  first  winning  predictor,  the  number  of  samples  per  segmented  region  decreases 
with  time.  Also,  similar  frequency  ranges  in  the  time  series  are  mapped  as  neighbors  in  the 


predictor  map. 
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Figure  6-4.  Speech  example:  (a)  time  series  and  phoneme  segmentation;  (b)  winning  pre- 
dictors. 

6.3.3  Simulation  III:  Phoneme  Segmentation  of  Speech 

For  this  simulation  the  time  series  consisted  of  the  word  “alone”  spoken  by  a male 
speaker  and  sampled  at  1 6 kHz.  The  signal,  which  was  extracted  from  the  Timit  database, 
is  shown  along  with  its  suggested  phoneme  segmentation  (a/l/ow/n)  in  Figure  6-4. 

For  the  predictors,  we  used  seven  12th  order  predictors,  in  order  to  allow  for  transi- 
tions between  the  four  phonemes.  The  memory  depth  of  the  error  was  200  samples.  The 
initial  neighborhood  and  learning  rate  were  128  and  1,  respectively.  The  winning  predic- 
tors after  training  are  shown  in  Figure  6-4  (b). 

The  segmentation  corresponds  fairly  closely  to  the  Timit  suggested  phoneme  segmen- 
tation, allowing  for  a transition  region  between  the  third  and  fourth  phonemes.  However, 
although  properly  segmented,  the  same  predictor  won  both  the  /a/  and  /ow/  phonemes. 
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This  simulation  also  illustrates  how  predictors  can  drop  out  of  the  competition,  as  predic- 
tors one  and  seven  win  less  than  1%  of  the  time,  so  that  only  five  of  the  original  seven  pre- 
dictors are  finally  active. 

6.4  Self-annealing  Competitive  Prediction 

Recall  that  the  ACE  algorithm  uses  an  output  based  gate  with  memory.  The  memory  is 
in  the  form  of  a boxcar  moving  average  (MA)  of  the  local  squared  error  of  the  experts,  to 
which  a Gaussian  function  is  applied.  The  variance  of  the  Gaussian  function  determines 
the  degree  of  competition.  In  the  ACE  algorithm,  memory  is  used  in  the  gating  function 
only.  Here  we  propose  a cost  function  that  imbeds  memory  in  both  the  gate  and  the 
experts’  mean  square  error 

K 

c(n)  = X Si(n)^i(n)  (6-12) 

i=  1 

where  s;-  is  the  local  recursive  mean  square  error  of  the  ith  predictor,  repeated  here  for  con- 
venience 

£,(«)  = XeJ(«)  + (l -X)ei(n -1)  0 < X < 1 . (6.13) 

Again,  e;(«)  is  the  instantaneous  error  of  the  ith  expert.  All  the  experts  use  the  same  value 
for  the  memory  parameter,  X,  but  the  algorithm  could  be  modified  to  allow  for  differing 
memories. 

6.4.1  Coupling  the  Competition  with  the  Memory  Depth 

Turning  to  the  gate,  in  the  ACE  algorithm,  the  degree  of  competition  within  the  gate, 
as  expressed  through  the  annealing  parameter,  and  the  memory  depth  of  the  squared  error 
are  uncoupled.  And  yet,  it  is  intuitive  that  the  longer  the  time  period  over  which  we  collect 
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information  about  the  performance  of  the  experts,  the  better  should  be  our  judgement 
about  which  expert  is  valid.  Such  a coupling  eliminates  the  need  for  separate  annealing  of 
the  memory  depth  and  the  competition.  Niedzwiecki  (1992)  has  in  fact  formulated  such  a 
relationship.  Assuming  that  the  variances  of  the  errors  of  the  experts  are  unknown  but  dis- 
tributed according  to  a so-called  non-informative  prior  distribution,  Niedzwiecki ’s  formula 
for  the  probability  of  the  ith  expert  is  given  by 

Cp.-(rt) 

£■(»)  = -T <6'14) 

Z 

j=  1 

where  the  unnormlized  gate  output  is  given  by 

-M 

9,-(»)  = !><(«)]  2 (6-15) 

and  Sj  is  the  causal  boxcar  MA  of  the  squared  error 

M-  1 

e,-(»)  = jj  Z (616) 

m = 0 

The  unnormalized  gate  given  by  equation  (6. 1 5)  embodies  the  coupling  between  mem- 
ory depth  and  degree  of  competition.  When  M- 1,  the  unnormalized  gate  is  just  the  abso- 
lute value  of  the  instantaneous  error  cp,(«)  = |e/(«)| . As  M— » oo,  the  gate  implements 
hard  competition  such  that  gl{n)=  1 when  i represents  the  expert  with  the  smallest  mean 
square  error,  and  g;(«)=0  for  all  others. 

Niedzwiecki  developed  his  formulas  for  evaluating  the  performance  of  known  experts. 
Here,  we  borrow  Niedzwiecki’s  results  and  incorporate  it  into  our  cost  function  (6.12)  for 
adaptable  experts.  We  use  (6.14)  as  the  gating  function,  except  that  we  replace  the  moving 
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average  calculation  of  the  mean  square  average  in  (6.16)  by  our  recursive  calculation  in 
(6.13).  Likewise,  we  replace  the  memory  depth,  M,  in  (6.15)  by  the  memory  parameter, 


<P,(«)  = [e,(«)]2X-  (6.17) 

6.4.2  Gradient  Descent 

We  then  do  gradient  descent  on  both  the  calculated  gate  and  the  mean  square  errors  of 
(6.12).  Taking  the  partial  derivative  with  respect  to  any  free  parameter,  wk,  in  the  kth  sys- 
tem, results  in: 

^ + (6.18) 


Avv* = ~^E{n)  = 


zk{n)  = ek(n)—yk(n)  + (\-\)zk(n-\) 


(6.19) 


and  with  respect  to  the  X parameter 


K 


AX 


7X  X gk 


2X 


k=  l 


E ^ , [£-e*]loge*(«)- 

L-  + 2J.-ljvtW _ 


(6.20) 


v*(0  = 


— S/tC"  — 1 ) 
x 


(6.21) 


Equation  (6.19)  for  zk  is  the  standard  weight  update  equation  with  momentum  for  the 
k expert  operating  independently.  Thus,  momentum  learning,  which  has  been  previously 
presented  as  an  ad-hoc  way  to  speed  up  and  stabilize  neural  network  training,  falls  out  nat- 
urally as  a result  of  using  the  mean  square  error  in  the  cost  function  (6.12)  instead  of  the 


instantaneous  error. 
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The  competitive  nature  of  the  algorithm  is  evident  in  (6.18),  where  the  total  weight 
update  is  given  by  the  product  of  zk  with  the  probability  of  the  k expert,  gk,  and  another 
term  which  depends  on  the  inverse  of  the  mean  square  error  of  the  kth  expert.  Thus,  the 
expert  with  the  smallest  mean  square  error  will  have  the  largest  weight  update.  Further- 
more, (6.18)  also  shows  how  the  annealing  is  coupled  with  the  memory  depth.  For  >>0.5, 
the  term  in  brackets  is  always  positive,  and  thus  all  the  experts  move  towards  the  data  to 
improve  their  predictions.  However,  For  >,<0.5,  the  sign  of  the  term  in  brackets  can  be 
either  positive  or  negative,  depending  on  whether  the  mean  square  error  of  the  kth  predic- 
tor is  less  or  greater  than,  respectively,  the  total  cost.  Thus,  experts  that  perform  poorly  can 
actually  be  pushed  away  from  the  data,  although  at  a small  learning  rate  due  to  the  gating 
function. 

6.4.3  Simulation 

To  test  the  algorithm,  we  used  the  same  switching  FIR  process  as  in  the  previous  sim- 
ulation. We  also  used  the  same  number  and  type  of  predictors.  The  initial  value  of  >.  was  1, 
and  after  30  iterations,  it  had  evolved  to  a value  of  0.6,  corresponding  to  a memory  depth 
of  approximately  1.6  samples.  The  winning  predictors  after  training  are  shown  in  Figure 
6-5  (b). 

The  time  series  was  partially  segmented,  except  there  is  more  spurious  switching  than 
there  was  for  the  Neighborhood  Map  of  Predictors.  This  is  because  >.  did  not  adapt  to  a 
small  enough  value  to  invoke  hard  competition,  whereas  in  the  previous  algorithm,  it  was 
experimentally  set  to  a value  which  gave  a stable  segmentation.  Still,  the  results  are 


encouraging. 
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Figure  6-5.  Switching  FIR  process:  (a)  time  series;  (b)  winning  predictors  after  training. 


6.5  Adding  Static  Memory  to  the  Mixture  of  Experts 
We  now  seek  to  add  memory  to  the  experts’  pdf’s.  We  will  do  this  for  the  Mixture  of 
Experts  model,  which  was  explained  in  Chapter  3.  The  key  point  in  advancing  this  theory 
is  to  regard  the  performance  of  the  experts  over  some  time  window,  v,  as  a single  event 
characterized  by  their  local  root  mean  square  error.  Furthermore,  the  gate  sees  all  the  data 
that  was  presented  to  the  experts  over  this  time  window,  and  produces  a single  output,  rep- 
resenting the  apriori  probabilities  of  the  various  experts  for  this  single  event.  In  this  way, 
the  gate  makes  its  decision  based  on  all  the  data  that  the  experts  have  processed. 

This  is  fundamentally  different  from  viewing  a set  of  expert  predictions  over  some 
time  window  as  an  ensemble.  There,  the  total  probability  of  the  ensemble  is  the  product  of 
the  individual  probabilities,  as  calculated  through  the  likelihood  product  of  Gaussian  prob- 
abilities. Here,  we  determine  the  probability  as  calculated  by  the  root  mean  square  error 
performance  based  on  a chi-square  distribution. 
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6.5.1  Adding  Static  Memory  to  the  Experts’  pdf 

The  performance  of  each  expert  will  be  modeled  by  its  local  root  mean  square  error 
over  time.  If  the  instantaneous  error  is  Gaussian  distributed,  then  the  root  mean  square 
error  is  chi-square  distributed.  The  local  root  mean  square  error  of  the  i expert  is 


x*(  o = -vMo 

(6.22) 

V — 1 

= l X e*('-x) 

(6.23) 

T = 0 

where  ek  is  the  instantaneous  error  of  the  kth  expert,  and  v is  the  memory  depth. 

6.5.2  Chi-Square  Distribution 

Dropping  reference  to  the  particular  expert  momentarily,  if  e is  Gaussian  distributed, 
then  x is  distributed  according  to  the  chi-square  distribution 

(v_n  _vx2 

i 2)  —2 

p(x)  = e (6-24) 

where  K is  a normalizing  constant  given  by 


K = Gamma 


(6.25) 


The  expected  value  of  the  of  the  mean  square  error  is  the  Gaussian  distribution’s  variance 


£[x2]  = = o1. 


(6.26) 


6.5.3  Gradient  Descent 

The  likelihood  is  now  formed  from  windows  of  data  of  size  v,  in  order  to  ensure  statis- 


tical independence 
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N K 

C = -lnl  = ^ -In  ^ hk(n)p(d(n)\x(n),  k)  . (6.27) 

n = v,  2v...  A:  = 1 

We  perform  gradient  descent  on  (6.27),  except  now  the  expert’s  pdf’s  are  chi-square 
distributions.  First,  we  need  the  partial  derivative  of  the  expert’s  chi-square  distributions 
with  respect  to  some  free  parameter,  in  the  kth  expert 


dPk 

dw. 


1 

v 1 


“k  a i 


(6.28) 


x = 0 


We  then  plug  (6.28)  into  (3.20)  to  obtain 


N 


dC 

dwL 


= ZW 


t = v,  2v. 


i-i- 

V 

e/t(0 


rv-1 


(6.29) 


Lx  = o 


Note  that  the  sum  within  the  right  most  brackets  represents  the  total  weight  change  over 
the  time  window  for  an  expert  operating  independently.  Also,  the  reason  for  modeling  the 
root  mean  square  error  (as  opposed  to  the  mean  square  error)  becomes  clear  once  it  is  seen 
that  (6.29)  reduces  to  the  equivalent  Mixture  of  Experts  equation  (3.20)  when  v=l. 

There  is  still  an  analytical  solution  for  the  variances,  which  is  now  expressed  in  terms 
of  the  expert’s  local  mean  square  error 


6C 

d°k 


N 

X A*(0e*(  0 


X W 

t = v,  2v... 


(6.30) 
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Based  on  (6.26),  the  winning  expert  will  be  the  one  whose  local  mean  square  error  best 
matches  its  variance,  in  which  case  the  algorithm  reduces  to  a block  form  of  the  ME  algo- 
rithm 


N 


£ks>ak 


dC 

dWr 


hk(  0: 


t = V,  2v... 


V-  I 


.x  = 0 


(6.31) 


It  should  be  noted  that  there  is  no  change  to  the  gradient  equation  for  the  gate,  other  than 
that  the  posterior  probabilities  are  calculated  using  the  chi-square  distribution. 

6.5.4  Simulation 

We  tested  both  the  ME  model  and  our  ME  with  memory  model  using  a simple  toy  time 
series  with  two  regimes.  The  time  series  consisted  of  a switching  AR(1)  process  driven  by 
white  noise.  Each  regime  consisted  of  200  samples  and  appears  only  once,  for  a total  time 
series  length  of  400  samples.  Each  model  employed  two  MA(1)  linear  predictors  as 
experts,  and  used  a two  hidden  layer  MLP  as  the  gate.  While  the  two  experts  made  their 
predictions  based  on  the  previous  value  only,  the  gate  used  the  previous  ten  values,  result- 
ing in  a gate  architecture  of  10-4-3-2.. 

The  results  for  the  ME  model  are  shown  in  Figure  6-6,  while  those  for  our  ME  with 
memory  model  are  shown  in  Figure  6-7.  The  solid  lines  represent  expert  I,  while  the 
dashed  lines  represent  expert  II.  Note  that  the  time  scale  in  Figure  6-7  is  ten  times  that  of 
Figure  6-6,  representing  the  fact  that  our  gate  fires  only  once  every  10  samples. 


A very  simple  time  series  was  chosen  to  emphasize  that  the  classical  ME  model  makes 
frequent  spurious  switching  errors  even  when  the  regimes  are  clearly  visible  to  the  eye,  as 
can  be  seen  in  Figure  6-6.  However,  in  our  model,  only  one  switching  error  occurs.  Fur- 
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squared  error;  (c)  pdf;  (d)  gate;  (e)  posterior  probability. 


thermore,  we  see  that  the  transition  between  regimes  as  indicated  by  the  gate  occurs 
instantaneously. 

It  is  natural  to  ask  whether  the  gate  output  of  Figure  6-6  could  be  “filtered  and  down- 
sampled”,  in  a probabilistic  sense,  to  improve  the  segmentation.  Such  a technique  is 
fraught  with  difficulties,  however,  because  probabilities  are  multiplicative.  A very  small 
probability  surrounded  in  time  by  unity  probabilities  still  results  in  a very  small  probabil- 
ity when  the  events  are  grouped  as  an  ensemble.  Our  algorithm  is  not  subject  to  this, 
because  the  gate  learns  to  produce  a single  output  that  is  the  result  of  the  local  perfor- 


mance of  the  experts  over  time. 
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Figure  6-7.  MOE  with  memory  simulation  (Expert  I (solid)  and  Expert  II  (dashed)):  (a) 
input  signal;  (b)  squared  error;  (c)  pdf;  (d)  gate;  (e)  posterior  probability. 


6.6  Adding  Recursive  Memory  to  the  Mixture  of  Experts 
Again,  we  model  the  root  mean  square  error  as  in  (6.22),  except  now  we  use  the  recur- 
sive estimate  of  (6.3),  both  repeated  here  for  convenience 

x*(o  = 7*50  (6-32) 

ek(n)  = Xe2k(n)  + (1  — X)ek(n  — 1)  0 < X < 1 . (6.33) 

6.6.1  Gamma  Distribution  of  the  Recursive  Mean  Square  Error 

Dropping  reference  to  the  particular  expert  momentarily,  under  the  assumption  that  the 
instantaneous  error  is  gaussian  distributed,  the  exact  distribution  of  % is  unknown.  How- 
ever, we  propose  to  try  and  fit  a gamma  distribution: 
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ZL 

/ \ _ 1 a P 
P(l)  = e 


(6.34) 


(a+i) 


'a  + r 

~T~ 


(6.35) 


In  order  to  evaluate  the  coefficients  a and  p,  we  take  the  2nd  and  4th  moments  of  both 
the  probability  distribution  and  the  root  mean  square  error  estimator,  and  equate  them. 
From  the  probability  distribution: 

00 

E[l]  = = (a  +2‘  (6.36) 

o 


00  2 

cr  4i  f 4 , s,  (a  + 1 )(a  + 3)P  ,,  i'i\ 

E[%  ] = J XP(l)dx  = ^ — • (6-37) 

o 

We  now  evalute  the  expectations  using  the  recursive  mean  square  error  estimator.  From 
(6.32), 


From  (6.33) 

£[X2(01  = £[8(01  and£[x4(01  = £[e2(01- 

(6.38) 

£[£(/)]  = X£[e2(!)]  + (1  ->.)£[£(/- 1)] 

(6.39) 

E[c\t )]  = X2£[e4(0]  + 2X(l-X)£[e2(/)]^[e(/-l)]  + (l->t)2^[£2(?-l)] 
where  e2  is  distributed  according  to  a first  order  chi-square  distribution,  so  that 

(6.40) 

E[e\t)]  = c2 

(6.41) 

E[e'(t)\  = 3c4. 

(6.42) 

Inserting  (6.41)  into  (6.39)  and  (6.40),  and  (6.42)  into  (6.40),  and  simplifying: 
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£[<=«]  = a2 
2,  , 2 + A.  4 

£[e  (01  2^1°  ‘ 

Equating  (6.36)  and  (6.43),  as  well  as  (6.37)  and  (6.44), 

(a+l)p  2 
2^  = G 


(6.43) 

(6.44) 


(6.45) 


(a  + 1 )(a  + 3)p2  _ 2 + A,  4 
4 2-X° 


(6.46) 


and  solving  for  a and  P 


2(1 -X) 


The  desired  probability  distribution  is  thus 


P = 


2 Xa 
2-X 


(6.47) 


2(1  -X)  dlzlly2 
. . 1 X.  2Xo2 

P(X)  = * 


(6.48) 


K = r 


A 


Y>_i 

'X  2 


11 

VA.  2J 


fl-I 

U 2. 


(6.49) 


In  order  to  check  the  quality  of  the  approximation,  the  difference  between  the  theoreti- 
cal and  experimental  distributions,  based  on  filtering  100,000  points,  is  shown  in  Figure 
6.8. 


Having  determined  the  parameters  of  the  gamma  distribution  from  the  2nd  and  4th 
moments,  it  is  of  interest  to  see  how  the  1st  moment  compares  to  the  experimental  data. 
Figure  6-9  below  show  the  theoretical  and  experimental  first  moment  for  various  memory 
depths. 
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Figure  6-8.  Theoretical  (line)  vs.  experimental  (histogram)  % distribution  from  A=1  (top 
panel)  to  A=0.1  (bottom  panel)  in  increments  of  1/10. 


Figure  6-9.  Theoretical  mean  (solid)  and  experimental  mean  (dotted)  vs.  X. 


Ill 


6.6.2  Gradient  Descent 

We  now  proceed  to  find  the  gradient  descent  equations  for  the  Mixture  of  Experts 
when  the  pdf’s  are  given  by  (6.48).  Once  again,  there  is  no  change  to  the  gradient  equation 
for  the  gate,  other  than  that  the  posterior  probabilities  are  calculated  using  the  gamma  dis- 
tribution. However,  the  update  of  the  weights  of  the  experts  is  different.  We  first  caclulate 


dPk 

dwk 


2(1  -h) 

ek 


(2-lk) 

2 

°k 


(6.50) 


**(  0 = + ( 1 - xk)z> 1 ) 


(6.51) 


Plugging  (6.50)  into  (3.20),  we  obtain 


N 


dC 

dwi 


- 1 

t=  l 


\2-Xk)  2(1  -Xk) 


hkzk 


(6.52) 


The  equation  for  the  variances  at  the  end  of  each  epoch  is  the  same  as  in  (6.30). 

There  are  two  ways  that  the  algorithm  reduces  to  the  mixture  of  experts.  When  X=\ 


N 

5C_=  y hkektyk 

dWn  2-1 


t=  1 


cl  dwk 


(6.53) 


which  is  identical  to  the  normal  MOE.  Also,  in  the  steady  state  for  the  winning  predictor 


dC 

dwk 


e*(  0 * 

(6.54) 

N 

v V*(0 

(6.55) 

2—i  2 

t-i  a« 

which  is  the  mixture  of  experts  with  momentum  learning  of  the  winning  expert’s  weights. 
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No  simulations  are  presented  for  this  algorithm,  as  it  performed  poorly  compared  with 
the  MOE  with  static  memory.  It  was  presented,  nonetheless,  since  future  applications  can- 
not necessarily  be  anticipated. 


CHAPTER  7 

CONCLUSIONS  AND  FUTURE  WORK 


7.1  Comparison  of  Present  Work  to  Classical  Segmentation 

We’ve  seen  how  gated  competitive  systems  can  be  used  to  segment  and  model  piece- 
wise  stationary  signals  in  a completely  unsupervised  fashion.  We  now  compare  the  classi- 
cal method  of  generalized  likelihood  ratio  (GLR)  to  the  new  paradigm  of  gated 
competitive  systems,  in  the  context  of  time  series  segmentation.  As  previously  stated,  the 
GLR  algorithm  is  a segmentation  method  that  detects  a transition  between  two  adaptive 
models,  by  considering  the  likelihood  of  all  possible  changepoints  within  the  current  sub- 
set of  data.  Once  a transition  is  detected,  the  algorithm  is  restarted  anew  from  the  change- 
point.  Although  the  GLR  develops  a model  for  the  data  both  before  and  after  a local 
changepoint,  this  information  is  usually  discarded.  Thus,  the  GLR  is  local  in  both  time  and 
model. 

Gated  competitive  experts  (GCE),  on  the  other  hand,  attempt  to  model  all  the  data 
simultaneously  using  a finite  mixture  of  adaptable  models.  It  is  thus  global  in  time  and 
local  in  model.  In  this  sense,  it  differs  from  the  GLR  in  that  a particular  model  (expert)  can 
be  active  over  non-contiguous  regions  of  the  data  set.  Both  approaches  have  advantages 
and  disadvantages,  which  we  will  now  discuss. 

Certainly  the  biggest  advantage  of  GLR  over  GCE  is  that  it  segments  the  data  in  a sin- 
gle pass.  This  does  not  mean  that  the  GLR  is  less  computationally  intensive.  However, 
certain  approximations  to  the  core  GLR  algorithm  can  reduce  the  computational  complex- 
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ity  at  a cost  of  increased  false  and/or  missed  detections.  Because  the  GCE  is  global  in 
time,  it  requires  several  iterations  through  the  entire  data  set.  However,  this  is  also  the  big- 
gest advantage  of  the  GCE  over  the  GLR.  By  being  global  in  time,  the  GCE  takes  advan- 
tage of  redundancies  in  the  data,  in  the  form  of  repeating  regimes,  to  form  better  estimates 
of  the  various  models  that  produced  the  regimes.  However,  because  the  GCE  algorithm 
possesses  many  local  minima,  it  is  not  necessarily  a more  reliable  estimator  of  the  change- 
points  themselves. 

When  a time  series  includes  one  or  more  very  short  stationary  segments,  the  GCE 
approach  is  clearly  superior  because  it  has  a much  smaller  deadzone  for  changepoint 
detection  than  the  GLR.  In  fact,  its  deadzone  is  basically  the  number  of  taps  of  the  tapped 
delay  line  at  the  input,  as  long  as  the  regime  is  redundant  and  occurs  elsewhere  within  the 
time  series.  The  deadzone  of  the  GCE  algorithm  depends  on  the  complexity  of  the  models 
and  the  detection  threshold,  but  it  is  very  unreliable  to  detect  a changepoint  in  a segment 
of  data  less  than,  say,  50  samples,  since  the  “variance”  of  the  autocorrelation  estimate 
increases  with  decreasing  samples  according  to  a Wishart  distribution. 

If  the  time  series  was  produced  by  piecewise  non-linear  dynamical  systems,  then  the 
GCE  approach  is  clearly  superior  to  the  GLR  algorithm.  The  GLR  algorithm  becomes 
intractable  when  the  models  are  non-linear,  because  the  models  must  be  trained  to  comple- 
tion for  each  potential  changepoint  within  each  block.  The  GCE  experts,  even  if  non-lin- 
ear, need  not  be  trained  to  completion  at  each  iteration  for  the  algorithm  to  converge  to  a 
valid  solution.  Also,  the  GCE  algorithm  lends  itself  to  parallelization,  since  each  expert 
can  be  run  on  a separate  machine. 


115 


We  now  ask  the  question  whether  the  GLR  algorithm  could  be  adapted  so  that  it  too 
makes  use  of  non-contiguous  regions  that  may  represent  the  same  dynamical  system.  Cer- 
tainly, the  model  parameters  of  each  stationary  region  encountered  can  be  stored  instead  of 
discarded.  However,  in  order  to  globally  model  the  data  similar  to  the  gated  competitive 
systems  approach,  the  number  of  generated  models  must  be  limited.  Thus,  once  a new 
changepoint  (and  hence  stationary  region)  is  detected,  some  decision  would  have  to  be 
made  as  to  whether  the  current  model  is  sufficiently  “close”  to  one  of  the  stored  models.  If 
the  current  model  is  a sufficient  match  for  a stored  model,  then  the  stored  model’s  parame- 
ters are  updated  to  reflect  the  addition  of  the  new  data.  If  the  current  model  is  sufficiently 
different  from  all  stored  models,  then  a new  reference  model  is  created  and  stored.  For  this 
approach  to  work,  there  must  be  a metric  for  measuring  the  difference  between  the  refer- 
ence model  and  the  current  model.  For  linear  models,  there  are  several  well  known  dis- 
tance metrics.  If  the  models  are  non-linear,  it  is  much  more  difficult  to  define  a distance 
metric.  In  addition,  it  may  not  be  possible  to  update  the  model  parameters  without  re- 
access to  all  prior  regions  of  data  that  were  assigned  to  the  model. 

1.2  Comparison  of  Input  and  Output  Based  Competitive  Systems 
Recall  that  an  input  based  gating  system,  such  as  the  Mixture  of  Experts,  uses  a neural 
network  that  learns  to  mediate  which  expert  is  valid  based  only  on  the  input  data.  On  the 
other  hand,  output  based  gating  systems,  such  as  the  Annealed  Competition  of  Experts,  the 
Neighborhood  Map  of  Competing  Predictors,  and  Self-Annealing  Competitive  Prediction, 
continually  reassess  which  expert  is  valid  based  on  their  recent  performance. 

There  is  a significant  difference  between  input  and  output  based  gating  when  the 
experts  are  predictors.  Because  output  based  gating  always  requires  a performance  mea- 
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sure  to  assign  validity  to  the  experts,  it  cannot  be  used  for  multi-step  prediction.  This 
means  that  it  is  restricted  to  use  as  an  analysis  tool.  However,  this  is  usually  not  a serious 
restriction  because  multi-step  prediction  requires  a higher  level  modeling  of  the  transitions 
between  stationary  regimes.  If  an  input  based  gate  system  is  performing  multi-step  predic- 
tion and  makes  an  error  in  expert  assignment,  the  prediction  will  rapidly  diverge. 

Higher  level  modeling  of  the  transitions  between  stationary  regimes  is  an  area  of 
research  that  has  only  just  begun  to  be  explored,  and  could  eventually  lead  to  speech  rec- 
ognition systems  that  completely  self-organize  around  a few  minutes  of  spoken  text. 

7.3  Competitive  PCA 

Competitive  PCA  is  a valid  alternative  to  competitive  prediction  for  the  segmentation 
problem.  The  segmentation  can  be  provided  either  by  the  gate  or  the  posterior  probabili- 
ties. Recall  that  the  gate  is  a neural  network  whose  input  is  the  original  signal  and  whose 
ith  output  is  an  estimate  of  the  prior  probability  of  the  ith  PCA  expert.  The  posterior  proba- 
bility of  the  ith  PCA  expert  is  the  gate  output  multiplied  by  the  pdf  of  the  ith  PCA  expert 
evaluated  at  the  local  reconstruction  error.  During  training,  the  posterior  probabilities  act 
as  targets  for  the  gate.  However,  the  role  of  the  gate  is  not  to  merely  memorize  the  poste- 
rior probabilities,  but  rather  to  smooth  them.  Once  trained,  the  output  of  the  gate  alone  is 
capable  of  segmenting  a new  signal.  Unfortunately,  because  the  gate  is  a neural  network, 
generalization  performance  can  vary  widely  depending  on  the  number  of  adaptable 
parameters,  the  quality  of  the  training  run,  etc.  However,  even  if  the  gate  is  not  generaliz- 
ing well,  the  posterior  probabilities  can  be  calculated,  and  these  used  to  segment  the  sig- 
nal, at  the  cost  of  a noisier  segmentation. 
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7.3.1  Application  to  Images 

For  images,  two-dimensional  prediction  is  not  well-defined,  and  thus  PCA  becomes  a 
natural  analysis  tool.  In  chapter  four,  we  saw  that  competitive  PCA  can  be  used  to  segment 
textures  within  an  image  in  a completely  unsupervised  fashion.  The  technique  is  limited  in 
a practical  sense  to  textures  that  are  defined  over  short  distance  scales.  Also,  autocorrela- 
tion is  known  to  be  an  incomplete  description  of  textures. 

PCA  is  sensitive  to  the  orientation  of  textures  with  directional  energy.  Thus,  while 
humans  naturally  recognize  rotations  of  a texture  as  belonging  to  the  same  class,  such 
rotations  equally  affect  the  PCA  representation  (masks)  and  cause  performance  degrada- 
tion when  analyzing  new  test  images  with  rotated  textures.  One  way  around  this  would  be 
to  re-train  the  gate,  after  the  basic  algorithm  has  converged,  with  rotated  versions  (90,  1 80, 
and  270  degrees)  of  the  training  image,  keeping  the  desired  response  the  same. 

7.3.2  Application  to  Time  Series 

When  the  system  input  is  delayed  versions  of  a time  series,  the  same  competitive  PCA 
system  can  be  used  to  segment  time  signals.  We  saw  how  temporal  PCA  picks  out  peaks  in 
the  signal  spectrum  as  a matched  filter.  This  makes  temporal  PCA  ideally  suited  to  seg- 
menting signals  dominated  by  a few  strong  linear  modes.  Temporal  PCA  also  is  the  opti- 
mal signal  to  noise  projection  for  additive  white  noise  and  the  optimal  linear  encoder. 

7.4  Temporal  PCA  and  the  Discrete  Prolate  Spheroids 

We  saw  how  the  discrete  prolate  spheroids  are  a natural  basis  for  the  eigenvectors  of 
the  autocorrelation  matrix  of  a time  series  (temporal  PCA).  Intuition  says  that  it  should  be 
possible  to  perform  faster  on-line  adaptation  of  the  principal  components  using  the  dis- 
crete prolate  spheroid  sequences  (DPSS)  as  a basis.  Only  two  parameters  would  need  to  be 
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adapted  for  the  first  and  largest  eigenvectors:  the  center  frequency  and  amplitude  of  the 
DPSS.  However,  the  basis  is  still  complete  for  all  center  frequencies,  and  thus  gradient 
descent  with  respect  to  that  parameter  cannot  be  used.  Some  other  criteria,  such  as  the 
Frobenius  norm  of  the  expansion  coefficients  is  required. 


APPENDIX 

PROPERTIES  OF  THE  DISCRETE  PROLATE  SPHEROIDS 


We  now  present  some  of  the  properties  of  Slepian’s  (1978)  discrete  prolate  spheroid 
sequences  (DPSS)  and  wave  functions  (DPSWF).  In  the  time  domain,  the  index  limited 
DPSS  are  the  eigenvectors  of 

Rvk  = Akvk  k = 0...N—  1 (A.l) 

when  the  components  of  R are  given  by  the  sine  function 

R mri  - r(m  — n ) = sin[27i  W(m  — n)]/(m  — n) . (A.2) 


The  index  limited  DPSS  are  alternately  symmetric  and  anti-symmetric 
vk(n ) “ vk(N—l—n)  k = even 

(A.3) 

= —vk(N—  1 —n)  k = odd 

The  DTFT  of  the  eigenvalue  equation  (A.l)  defines  the  discrete  prolate  spheroidal 
wave  functions  (DPSWF) 
w 

J Uk(f’' N’  'r)df  = Ak(N'  ^ Ul(/' N'  ^ (A'4) 

-w 

where  the  DPSWF  are  related  to  the  index  limited  DPSS  through  the  DTFT 

N-  1 

Uk(f)  = H 2 vt(n)ennA-kN-')n]  (A.5) 

n = 0 


and  where  ek  is  a constant  (1  or  i)  defined  to  ensure  that  the  DPSWF  are  real.  The  DPSWF 
are  doubly  orthogonal 
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W 1/2 

A ('nW)  J Um(f,  N,  W)U,(f,  N,  W)df  = J UJf,  N,  W)U,(f,  N,  W)df  = 8„„ . (A.6) 

—W  -1/2 

Because  of  this  double  orthogonality  property,  the  inverse  DTFT  is  expressible  over  two 
ranges 


1/2 


vk(n)  = - J Uk(f,N,W)ei2njln  (N  X)/1]df 


-1/2 


1 


W 


(A. 7) 


k(N,  W) 


j Uk(f  N,  W)e 


-i2n/[n-(N-\)/2] 


df 


-w 


The  Dirichlet  kernel  is  expressible  in  terms  of  the  DPSWF 


N-  1 

= ^ N’  ^ u,(r  ~u  N'  ^ ' (A'8) 
« = 0 

where /0  is  an  arbitrary  frequency  offset. 

Even  though  there  is  no  general  analytical  expression  for  the  DPSWF,  the  DPSS  and 
DPSWF  have  a number  of  interesting  properties.  Of  all  index  limited  sequences  of  length 
N,  the  index  limited  DPSS  have  the  greatest  concentration  of  energy  in  the  band  \f\<W, 
subject  to  orthogonality  constraints.  The  eigenvalues  represent  the  fraction  of  energy 
within  the  band.  Since  the  DPSS  and  DPSWF  are  still  subject  to  the  so  called  time  band- 
width product  law,  the  first  2 NW  eigenvalues  are  close  to  one,  with  the  remaining  eigenval- 
ues approaching  zero. 

If  C/Af.](4Ar,IF)  has  the  smallest  possible  energy  concentration  within  \f\<W,  then  it  has 
the  largest  for  j f\>W,  and  thus  UNA( Zi-fN, Vi- W)=Uq(J,N, W).  This  generalizes  to  all  pairs 


of  DPSWF ’s  as  follows 
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Uk(f,  N,W)  = C/«_ , -/  JV,  | - W')  • (A-9) 

As  W-> 0,  the  kth  DPSWF  evaluated  at  fW  becomes  the  kth  Legendre  polynomial. 
Using  (A.9),  one  can  also  show  that  as  W—>/t,  the  DPSWF  evaluated  at  JW  are  the  Leg- 
endre polynomials. 


The  following  figures  demonstrate  some  of  the  properties  of  the  DPSWF  for  the  case 
when  N=7,  the  largest  value  for  which  the  software  program  Mathematica  v3.0  can  find  a 
symbolic  solution  to  the  time  domain  eigenvalue  equation. 


Figure  A-l.  Eigenvalues  versus  half-bandwidth,  W,  for  &=0...7. 
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Figure  A-2.  U(ff,l,W)  versus/for  fV=0.05,  0.1,  0.15,  0.2,  0.25,  0.3,  0.35,  and  0.4. 
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Figure  A-3.  (a)  Utf, 7,0.1)  and  (b)  lfk(f,  7,  0.1 ) versus/for  k=  0...7 
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